tk_nn_classifier.data_loader package

Submodules

tk_nn_classifier.data_loader.common_loader module

Basic class to read files also get the field names relevant to the training

class tk_nn_classifier.data_loader.base_loader.BaseLoader(config)

Bases: object

get_details(data_path)
get_train_data(data_path)

tk_nn_classifier.data_loader.csv_loader module

CSV file reader: import data from csv files

class tk_nn_classifier.data_loader.csv_loader.CSVLoader(config)

Bases: tk_nn_classifier.data_loader.base_loader.BaseLoader

get_details(data_path)
get_train_data(data_path)
split_data(data_path, ratio=0.8, des='models')

split the data into train and evel

tk_nn_classifier.data_loader.data_reader module

class tk_nn_classifier.data_loader.data_reader.DataReader(config)

Bases: object

get_data_set(data_path)
get_data_set_with_detail(data_path)
get_split_data()

tk_nn_classifier.data_loader.label_class_mapper module

Label mapping: create mapping from label to class_id

class tk_nn_classifier.data_loader.label_class_mapper.LabelClassMapper(classid_to_label, label_mapper_file='label_mapper.json')

Bases: object

class_id(label)
classmethod from_file(label_mapper_file)
classmethod from_labels(labels, label_mapper_file='label_mapper.json')
label_name(class_id)
write()

tk_nn_classifier.data_loader.spacy_data_reader module

SpaCy data reader: prepare the train/eval data in spaCy format

class tk_nn_classifier.data_loader.spacy_data_reader.SpacyDataReader(config)

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path, shuffle=False, train_mode=False)

tk_nn_classifier.data_loader.tf_data_reader module

TF data reader: prepare the input data

class tk_nn_classifier.data_loader.tf_data_reader.TFDataReader(config)

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path)

tk_nn_classifier.data_loader.tokenizer module

tk_nn_classifier.data_loader.tokenizer.tokenize(string)

tokenize string, and return the list of normalized tokens

tk_nn_classifier.data_loader.trxml_loader module

TRXML file reader: import data from trxml files

class tk_nn_classifier.data_loader.trxml_loader.TRXMLLoader(config)

Bases: tk_nn_classifier.data_loader.base_loader.BaseLoader

get_details(data_path)
get_train_data(data_path)
split_data(data_path, ratio=0.8, des='models')

split the data into train and evel

tk_nn_classifier.data_loader.word_vector module

Basic class for word embedding

class tk_nn_classifier.data_loader.word_vector.WordVector(inputfile)

Bases: object

word embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors

PAD = 'xxPADxx'
PAD_ID = 0
UNK = 'xxUNKxx'
UNK_ID = 1
cosine_nearest_neighbors(input_vector, nr_neighbors=10)

compute the nearest n neighbours of any input vector

static create_vocab_index_dict(vocab)
get_index(word)

lookup the index given the word in the embedding

get_vector(word)

lookup the vector given the word in the embedding

get_vectors(words)

lookup the vectors given words

get_word(index)

look up the token in vocabulary with given index

classmethod read_embeddings(inputfile, vacab_unicode_size=78)

Read embeddings files and return a vocabulary and vectors array.

params:
inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
returns:

vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing

embeddings
classmethod read_embeddings_header(inputfile, mimetype='text/plain')

read the header from the file, note that both binary and text have the same header.

save_sublist(words, output_file)

Generate a smaller binary word-embeddings model file with only the given list of words.

unk_vector

get the default vector for the unkown word

vector_size

get the length of the word embedding

vocab_size

get the number of tokens in vocabulary

tk_nn_classifier.data_loader.word_vector.maxabs(embeddings, axis=0)

Return slice of embeddings, keeping only those values that are furthest away from 0 along axis

tk_nn_classifier.data_loader.word_vector.unitvec(vec)

normalize the vector

Module contents

class tk_nn_classifier.data_loader.WordVector(inputfile)

Bases: object

word embedding class: it is created from the word2vec type word_embedding, and it contains - vocab - vectors

PAD = 'xxPADxx'
PAD_ID = 0
UNK = 'xxUNKxx'
UNK_ID = 1
cosine_nearest_neighbors(input_vector, nr_neighbors=10)

compute the nearest n neighbours of any input vector

static create_vocab_index_dict(vocab)
get_index(word)

lookup the index given the word in the embedding

get_vector(word)

lookup the vector given the word in the embedding

get_vectors(words)

lookup the vectors given words

get_word(index)

look up the token in vocabulary with given index

classmethod read_embeddings(inputfile, vacab_unicode_size=78)

Read embeddings files and return a vocabulary and vectors array.

params:
inputfile: a single filepath as string vacab_unicode_size: max length of words in vocab (chars)
returns:

vocab: [vocab_size, vacab_unicode_size] unicode array vectors: [vocab_size, vector_size] float array containing

embeddings
classmethod read_embeddings_header(inputfile, mimetype='text/plain')

read the header from the file, note that both binary and text have the same header.

save_sublist(words, output_file)

Generate a smaller binary word-embeddings model file with only the given list of words.

unk_vector

get the default vector for the unkown word

vector_size

get the length of the word embedding

vocab_size

get the number of tokens in vocabulary

class tk_nn_classifier.data_loader.DataReader(config)

Bases: object

get_data_set(data_path)
get_data_set_with_detail(data_path)
get_split_data()
class tk_nn_classifier.data_loader.SpacyDataReader(config)

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path, shuffle=False, train_mode=False)
class tk_nn_classifier.data_loader.TFDataReader(config)

Bases: tk_nn_classifier.data_loader.data_reader.DataReader

get_data(data_path)
tk_nn_classifier.data_loader.tokenize(string)

tokenize string, and return the list of normalized tokens

tk_nn_classifier.data_loader.download_tk_embedding(language: str, target_file: str) → None

Download the word-embeddings if not already present