src package¶
Subpackages¶
Submodules¶
src.encoders module¶
-
class
src.encoders.ALBERTEncoder(max_seq_length=512)¶ Bases:
src.encoders.Encoder-
encode(text, context=None, string_type='response')¶ Encode an iterable of strings
-
finetune_weights(question, answer, margin=0.3, loss='triplet', context=[], neg_answer=[], neg_answer_context=[], label=[])¶ Finetune the model with GradientTape
Parameters: - question (list of str) – List of string queries
- answer (list of str) – List of string responses
- context (list of str) – List of string response contexts, this is applicable to the USE model
- neg_answer (list of str) – List of string responses that do not match with the queries. This is applicable for triplet / contrastive loss.
- neg_answer_context (list of str) – Similar to neg_answer for the USE model to ingest
- label (list of int) – List of int
- margin (float) – Marrgin tuning parameter for triplet / contrastive loss
- loss (str) – Specify loss function
Returns: numpy array of mean loss value
-
init_signatures()¶ Re-init references to layers and model attributes When restoring the model, the references to the vocab file / layers would be lost.
-
restore_weights(save_dir=None)¶ Load weights from savepath
-
save_weights(save_dir=None)¶ Save the BERT model weights into a directory
-
-
class
src.encoders.BERTEncoder(max_seq_length=512)¶ Bases:
src.encoders.Encoder-
encode(text, context=None, string_type='response')¶ Return the tensor representing embedding of input text. Type can be ‘query’ or ‘response’
Parameters: - text (str or iterable of str) – This contains the text that is required to be encoded
- type (str) – Either ‘response’ or ‘query’. Default is ‘response’. In the case of BERT, this argument is ignored
Returns: a tf.tensor that contains the 768 dim encoding of the input text
-
finetune_weights(question, answer, margin=0.3, loss='triplet', context=[], neg_answer=[], neg_answer_context=[], label=[])¶ Finetune the model with GradientTape
Parameters: - question (list of str) – List of string queries
- answer (list of str) – List of string responses
- context (list of str) – List of string response contexts, this is applicable to the USE model
- neg_answer (list of str) – List of string responses that do not match with the queries. This is applicable for triplet / contrastive loss.
- neg_answer_context (list of str) – Similar to neg_answer for the USE model to ingest
- label (list of int) – List of int
- margin (float) – Marrgin tuning parameter for triplet / contrastive loss
- loss (str) – Specify loss function
Returns: numpy array of mean loss value
-
init_signatures()¶ Re-init references to layers and model attributes When restoring the model, the references to the vocab file / layers would be lost.
-
restore_weights(save_dir=None)¶ Load saved model from savepath
-
save_weights(save_dir=None)¶ Save the BERT model into a directory
-
-
class
src.encoders.Encoder¶ Bases:
abc.ABCa shared encoder interface Each encoder should provide an encode() method
-
encode()¶
-
finetune_weights()¶
-
restore_weights()¶
-
save_weights()¶
-
-
class
src.encoders.USEEncoder(max_seq_length=None, **kwargs)¶ Bases:
src.encoders.Encoder-
encode(text, context=None, string_type=None)¶
-
finetune_weights(question, answer, margin=0.3, loss='triplet', context=[], neg_answer=[], neg_answer_context=[], label=[])¶ Finetune the model with GradientTape
Parameters: - question (list of str) – List of string queries
- answer (list of str) – List of string responses
- context (list of str) – List of string response contexts, this is applicable to the USE model
- neg_answer (list of str) – List of string responses that do not match with the queries. This is applicable for triplet / contrastive loss.
- neg_answer_context (list of str) – Similar to neg_answer for the USE model to ingest
- label (list of int) – List of int
- margin (float) – Marrgin tuning parameter for triplet / contrastive loss
- loss (str) – Specify loss function
Returns: numpy array of mean loss value
-
init_signatures()¶
-
restore_weights(save_dir=None)¶ Signatures need to be re-init after weights are loaded.
-
save_weights(save_dir=None)¶ Save model weights in folder directory
-
src.loss_functions module¶
-
src.loss_functions.triplet_loss(anchor_vector, positive_vector, negative_vector, metric='cosine_dist', margin=0.009)¶ Computes the triplet loss with semi-hard negative mining. The loss encourages the positive distances (between a pair of embeddings with the same labels) to be smaller than the minimum negative distance among which are at least greater than the positive distance plus the margin constant (called semi-hard negative) in the mini-batch. If no such negative exists, uses the largest negative distance instead. See: https://arxiv.org/abs/1503.03832.
Parameters: - anchor_vector (tf.Tensor) – The anchor vector in this use case should be the encoded query.
- positive_vector (tf.Tensor) – The positive vector in this use case should be the encoded response.
- negative_vector (tf.Tensor) – The negative vector in this use case should be the wrong encoded response.
- metric (str) – Specify loss function
- margin (float) – Margin parameter in loss function. See link above.
Returns: the triplet loss value, as a tf.float32 scalar.
src.minio_handler module¶
-
class
src.minio_handler.MinioClient(url_endpoint, access_key, secret_key)¶ Bases:
object-
download_emb_index(bucket_name, emb_obj_name, emb_file_path)¶
-
download_model_weights(bucket_name, model_obj_name, model_file_path)¶
-
make_bucket(bucket_name)¶
-
rm_bucket(bucket_name)¶
-
upload_emb_index(bucket_name, emb_obj_name, emb_file_path)¶
-
upload_model_weights(bucket_name, model_obj_name, model_file_path)¶
-
src.models module¶
-
class
src.models.GoldenRetriever(encoder)¶ Bases:
src.models.Model-
export_encoder(save_dir)¶ Path should include partial filename. https://www.tensorflow.org/api_docs/python/tf/saved_model/save
-
finetune(question, answer, margin=0.3, loss='triplet', context=[], neg_answer=[], neg_answer_context=[], label=[])¶ finetunes encoder
-
load_kb(kb_)¶ Load the knowledge base or bases
Parameters: kb – kb object as defined in kb_handler
-
make_query(querystring, top_k=5, index=False, predict_type='query', kb_name='default_kb')¶ Make a query against the stored vectorized knowledge.
Parameters: - type (str) – can be ‘query’ or ‘response’. Use to compare statements
- kb_name (str) – the name of knowledge base in the knowledge dictionary
- index (boolean) – Choose index=True to return sorted index of matches.
Returns: Top K vectorized answers and their scores
-
predict(text, context=None, string_type='response')¶ encode method of encoder will be used to vectorize texts
-
restore_encoder(save_dir)¶ Signatures need to be re-init after weights are loaded.
-
-
class
src.models.Model¶ Bases:
abc.ABCa shared model interface where each model should provide finetune, predict, make_query, export_encoder, restore_encoder methods
-
export_encoder()¶ export finetuned weights
-
finetune()¶ finetunes encoder
-
load_kb()¶ load and encode knowledge bases to return predictions
-
make_query()¶ uses predict method to vectorize texts and provides relevant responses based on given specifications (eg. num responses) to user
-
predict()¶ encode method of encoder will be used to vectorize texts
-
restore_encoder()¶ restores encoder with finetuned weights
-
src.prebuilt_index module¶
-
class
src.prebuilt_index.SimpleNNIndex(emb_dim_size, metric='angular')¶ Bases:
simpleneighbors.SimpleNeighborsSimple Neighbors Index for calculating similarity between queries and reponses vectorized by Golden Retriever
This class wraps the SimpleNeighbors python package. SimpleNeighbors will select a backend implementation depending on what packages are available in your environment. Therefore it is recommended that you install Annoy
pip install annoyto enable the Annoy backend.Parameters: - emb_dim_size – number of dimensions in the data (eg. 512)
- metric – distance metric to use. Default is ‘angular’, which is an approximation of cosine distance
-
build(sentences, sentence_embeddings)¶ builds precomputed vector index from QA responses. uses the Annoy library by default.
Parameters: - sentences – responses in string form
- sentence_embeddings – responses in embedding form
Returns: simpleneighbors index for nearest neighbors vector lookup
-
classmethod
load(prefix)¶ restores a previously-saved index
Parameters: prefix – prefix used when saving index Returns: SimpleNNIndex object restored from specified files
-
query(query_embeddings, num_nbrs)¶ finds response closest to the query vector
The query vector should have the same number of dimensions as the dimensions of the index. Search is limited to the given number of items. Results are given in order of proximity. :param query_embeddings: query in embedding form :param num_nbrs: number of results to return :return: list of items sorted by pro
-
save(index_prefix)¶ saves index to disk. With the Annoy backend, there are two files produced: the serialized Annoy index and a pickle with other data from the object
Parameters: prefix – filename prefix for the Annoy index and object data Returns: None