tk_nn_classifier.data_loader package¶
Submodules¶
tk_nn_classifier.data_loader.common_loader module¶
Basic class to read files also get the field names relevant to the training
tk_nn_classifier.data_loader.csv_loader module¶
CSV file reader: import data from csv files
-
class
tk_nn_classifier.data_loader.csv_loader.CSVLoader(config)¶ Bases:
tk_nn_classifier.data_loader.base_loader.BaseLoader-
get_details(data_path)¶
-
get_train_data(data_path)¶
-
split_data(data_path, ratio=0.8, des='models')¶ split the data into train and evel
-
tk_nn_classifier.data_loader.data_reader module¶
tk_nn_classifier.data_loader.label_class_mapper module¶
Label mapping: create mapping from label to class_id
-
class
tk_nn_classifier.data_loader.label_class_mapper.LabelClassMapper(classid_to_label, label_mapper_file='label_mapper.json')¶ Bases:
object-
class_id(label)¶
-
classmethod
from_file(label_mapper_file)¶
-
classmethod
from_labels(labels, label_mapper_file='label_mapper.json')¶
-
label_name(class_id)¶
-
write()¶
-
tk_nn_classifier.data_loader.spacy_data_reader module¶
SpaCy data reader: prepare the train/eval data in spaCy format
-
class
tk_nn_classifier.data_loader.spacy_data_reader.SpacyDataReader(config)¶ Bases:
tk_nn_classifier.data_loader.data_reader.DataReader-
get_data(data_path, shuffle=False, train_mode=False)¶
-
tk_nn_classifier.data_loader.tf_data_reader module¶
TF data reader: prepare the input data
-
class
tk_nn_classifier.data_loader.tf_data_reader.TFDataReader(config)¶ Bases:
tk_nn_classifier.data_loader.data_reader.DataReader-
get_data(data_path)¶
-
tk_nn_classifier.data_loader.tokenizer module¶
-
tk_nn_classifier.data_loader.tokenizer.tokenize(string)¶ tokenize string, and return the list of normalized tokens
tk_nn_classifier.data_loader.trxml_loader module¶
TRXML file reader: import data from trxml files
-
class
tk_nn_classifier.data_loader.trxml_loader.TRXMLLoader(config)¶ Bases:
tk_nn_classifier.data_loader.base_loader.BaseLoader-
get_details(data_path)¶
-
get_train_data(data_path)¶
-
split_data(data_path, ratio=0.8, des='models')¶ split the data into train and evel
-
tk_nn_classifier.data_loader.word_vector module¶
Basic class for word embedding
-
class
tk_nn_classifier.data_loader.word_vector.WordVector(inputfile)¶ Bases:
objectword embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors
-
PAD= 'xxPADxx'¶
-
PAD_ID= 0¶
-
UNK= 'xxUNKxx'¶
-
UNK_ID= 1¶
-
cosine_nearest_neighbors(input_vector, nr_neighbors=10)¶ compute the nearest n neighbours of any input vector
-
static
create_vocab_index_dict(vocab)¶
-
get_index(word)¶ lookup the index given the word in the embedding
-
get_vector(word)¶ lookup the vector given the word in the embedding
-
get_vectors(words)¶ lookup the vectors given words
-
get_word(index)¶ look up the token in vocabulary with given index
-
classmethod
read_embeddings(inputfile, vacab_unicode_size=78)¶ Read embeddings files and return a vocabulary and vectors array.
- params:
- inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
- returns:
vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing
embeddings
-
classmethod
read_embeddings_header(inputfile, mimetype='text/plain')¶ read the header from the file, note that both binary and text have the same header.
-
save_sublist(words, output_file)¶ Generate a smaller binary word-embeddings model file with only the given list of words.
-
unk_vector¶ get the default vector for the unkown word
-
vector_size¶ get the length of the word embedding
-
vocab_size¶ get the number of tokens in vocabulary
-
-
tk_nn_classifier.data_loader.word_vector.maxabs(embeddings, axis=0)¶ Return slice of embeddings, keeping only those values that are furthest away from 0 along axis
-
tk_nn_classifier.data_loader.word_vector.unitvec(vec)¶ normalize the vector
Module contents¶
-
class
tk_nn_classifier.data_loader.WordVector(inputfile)¶ Bases:
objectword embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors
-
PAD= 'xxPADxx'¶
-
PAD_ID= 0¶
-
UNK= 'xxUNKxx'¶
-
UNK_ID= 1¶
-
cosine_nearest_neighbors(input_vector, nr_neighbors=10)¶ compute the nearest n neighbours of any input vector
-
static
create_vocab_index_dict(vocab)¶
-
get_index(word)¶ lookup the index given the word in the embedding
-
get_vector(word)¶ lookup the vector given the word in the embedding
-
get_vectors(words)¶ lookup the vectors given words
-
get_word(index)¶ look up the token in vocabulary with given index
-
classmethod
read_embeddings(inputfile, vacab_unicode_size=78)¶ Read embeddings files and return a vocabulary and vectors array.
- params:
- inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
- returns:
vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing
embeddings
-
classmethod
read_embeddings_header(inputfile, mimetype='text/plain')¶ read the header from the file, note that both binary and text have the same header.
-
save_sublist(words, output_file)¶ Generate a smaller binary word-embeddings model file with only the given list of words.
-
unk_vector¶ get the default vector for the unkown word
-
vector_size¶ get the length of the word embedding
-
vocab_size¶ get the number of tokens in vocabulary
-
-
class
tk_nn_classifier.data_loader.DataReader(config)¶ Bases:
object-
get_data_set(data_path)¶
-
get_data_set_with_detail(data_path)¶
-
get_split_data()¶
-
-
class
tk_nn_classifier.data_loader.SpacyDataReader(config)¶ Bases:
tk_nn_classifier.data_loader.data_reader.DataReader-
get_data(data_path, shuffle=False, train_mode=False)¶
-
-
class
tk_nn_classifier.data_loader.TFDataReader(config)¶ Bases:
tk_nn_classifier.data_loader.data_reader.DataReader-
get_data(data_path)¶
-
-
tk_nn_classifier.data_loader.tokenize(string)¶ tokenize string, and return the list of normalized tokens
-
tk_nn_classifier.data_loader.download_tk_embedding(language: str, target_file: str) → None¶ Download the word-embeddings if not already present