tk_nn_classifier.data_loader package¶

Submodules¶

tk_nn_classifier.data_loader.common_loader module¶

Basic class to read files also get the field names relevant to the training

class tk_nn_classifier.data_loader.base_loader.BaseLoader(config)¶

Bases: object

get_details(data_path)¶

get_train_data(data_path)¶

tk_nn_classifier.data_loader.csv_loader module¶

CSV file reader: import data from csv files

class tk_nn_classifier.data_loader.csv_loader.CSVLoader(config)¶

Bases: tk_nn_classifier.data_loader.base_loader.BaseLoader

get_details(data_path)¶

get_train_data(data_path)¶

split_data(data_path, ratio=0.8, des='models')¶: split the data into train and evel

tk_nn_classifier.data_loader.data_reader module¶

class tk_nn_classifier.data_loader.data_reader.DataReader(config)¶

Bases: object

get_data_set(data_path)¶

get_data_set_with_detail(data_path)¶

get_split_data()¶

tk_nn_classifier.data_loader.label_class_mapper module¶

Label mapping: create mapping from label to class_id

class tk_nn_classifier.data_loader.label_class_mapper.LabelClassMapper(classid_to_label, label_mapper_file='label_mapper.json')¶

Bases: object

class_id(label)¶

classmethod from_file(label_mapper_file)¶

classmethod from_labels(labels, label_mapper_file='label_mapper.json')¶

label_name(class_id)¶

write()¶

tk_nn_classifier.data_loader.spacy_data_reader module¶

SpaCy data reader: prepare the train/eval data in spaCy format

class tk_nn_classifier.data_loader.spacy_data_reader.SpacyDataReader(config)¶

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path, shuffle=False, train_mode=False)¶

tk_nn_classifier.data_loader.tf_data_reader module¶

TF data reader: prepare the input data

class tk_nn_classifier.data_loader.tf_data_reader.TFDataReader(config)¶

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path)¶

tk_nn_classifier.data_loader.tokenizer module¶

tk_nn_classifier.data_loader.tokenizer.tokenize(string)¶: tokenize string, and return the list of normalized tokens

tk_nn_classifier.data_loader.trxml_loader module¶

TRXML file reader: import data from trxml files

class tk_nn_classifier.data_loader.trxml_loader.TRXMLLoader(config)¶

Bases: tk_nn_classifier.data_loader.base_loader.BaseLoader

get_details(data_path)¶

get_train_data(data_path)¶

split_data(data_path, ratio=0.8, des='models')¶: split the data into train and evel

tk_nn_classifier.data_loader.word_vector module¶

Basic class for word embedding

class tk_nn_classifier.data_loader.word_vector.WordVector(inputfile)¶

Bases: object

word embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors

PAD = 'xxPADxx'¶

PAD_ID = 0¶

UNK = 'xxUNKxx'¶

UNK_ID = 1¶

cosine_nearest_neighbors(input_vector, nr_neighbors=10)¶: compute the nearest n neighbours of any input vector

static create_vocab_index_dict(vocab)¶

get_index(word)¶: lookup the index given the word in the embedding

get_vector(word)¶: lookup the vector given the word in the embedding

get_vectors(words)¶: lookup the vectors given words

get_word(index)¶: look up the token in vocabulary with given index

classmethod read_embeddings(inputfile, vacab_unicode_size=78)¶

Read embeddings files and return a vocabulary and vectors array.

params:: inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
returns:: vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing

embeddings

classmethod read_embeddings_header(inputfile, mimetype='text/plain')¶: read the header from the file, note that both binary and text have the same header.

save_sublist(words, output_file)¶: Generate a smaller binary word-embeddings model file with only the given list of words.

unk_vector¶: get the default vector for the unkown word

vector_size¶: get the length of the word embedding

vocab_size¶: get the number of tokens in vocabulary

tk_nn_classifier.data_loader.word_vector.maxabs(embeddings, axis=0)¶: Return slice of embeddings, keeping only those values that are furthest away from 0 along axis

tk_nn_classifier.data_loader.word_vector.unitvec(vec)¶: normalize the vector

Module contents¶

class tk_nn_classifier.data_loader.WordVector(inputfile)¶

Bases: object

word embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors

PAD = 'xxPADxx'¶

PAD_ID = 0¶

UNK = 'xxUNKxx'¶

UNK_ID = 1¶

cosine_nearest_neighbors(input_vector, nr_neighbors=10)¶: compute the nearest n neighbours of any input vector

static create_vocab_index_dict(vocab)¶

get_index(word)¶: lookup the index given the word in the embedding

get_vector(word)¶: lookup the vector given the word in the embedding

get_vectors(words)¶: lookup the vectors given words

get_word(index)¶: look up the token in vocabulary with given index

classmethod read_embeddings(inputfile, vacab_unicode_size=78)¶

Read embeddings files and return a vocabulary and vectors array.

params:: inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
returns:: vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing

embeddings

classmethod read_embeddings_header(inputfile, mimetype='text/plain')¶: read the header from the file, note that both binary and text have the same header.

save_sublist(words, output_file)¶: Generate a smaller binary word-embeddings model file with only the given list of words.

unk_vector¶: get the default vector for the unkown word

vector_size¶: get the length of the word embedding

vocab_size¶: get the number of tokens in vocabulary

class tk_nn_classifier.data_loader.DataReader(config)¶

Bases: object

get_data_set(data_path)¶

get_data_set_with_detail(data_path)¶

get_split_data()¶

class tk_nn_classifier.data_loader.SpacyDataReader(config)¶

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path, shuffle=False, train_mode=False)¶

class tk_nn_classifier.data_loader.TFDataReader(config)¶

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path)¶

tk_nn_classifier.data_loader.tokenize(string)¶: tokenize string, and return the list of normalized tokens

tk_nn_classifier.data_loader.download_tk_embedding(language: str, target_file: str) → None¶: Download the word-embeddings if not already present