Data Handlers¶

GoldenRetriever’s knowledge bases can be preprocessed, parsed and uploaded into elasticsearch with the following subpackages.

Elasticsearch Interface¶

Knowledge Base Handler¶

kb_handler converts knowledge bases in various data types (txt, csv, sql) into a kb object. This kb object may then be further used for finetuning and eval

src.data_handler.kb_handler.generate_mappings(responses, queries)¶

Generate a list of list mappings between responses and queries The length of responses and queries must be the same. To note, the argument takes Responses then Queries but the returned mappings list Queries then Responses for convenient use in downstream scripts

Parameters:	responses (pd.Series) – contains query strings, may be non unique queries (pd.Series) – contains query strings, may be non unique
Returns:	mappings that is a list of list of ints containing mappings between queries responses

class src.data_handler.kb_handler.kb(name, responses, queries, mapping, vectorised_responses=None)¶

Bases: object

create_df()¶

Create pandas DataFrame in a similar format to dataloader.py which may be used for finetuning and evaluation.

Importantly, if there is many-to-many matches between Queries and Responses the returned dataframe will have duplicates

Returns:	pd.DataFrame that contains the columns query_string, processed_string, kb_name

json(hashkey=None)¶: Create json dict to use with flask endpoint

class src.data_handler.kb_handler.kb_handler¶

Bases: object

kb_handler loads knowledge bases from text files

load_es_kb(kb_names=[])¶

Load the knowledge bases from elasticsearch

Parameters:	kb_names (list) – to list specific kb_names to parse else if empty, parse all of them
Returns:	list of kb class objects

parse_csv(path, answer_col='', query_col='', context_col='', kb_name='')¶

Parse CSV file into kb format As pandas leverages csv.sniff to parse the csv, this function leverages pandas.

Parameters:	kb_name (str) – name of output kb object df (pd.DataFrame) – contains the queries, responses and context strings answer_col (str) – column name string that points to responses query_col (str) – column name string that points to queries context_col (str) – column name string that points to context strings
Returns:	kb class object

parse_df(kb_name, df, answer_col, query_col='', context_col='')¶

parses pandas DataFrame into responses, queries and mappings

Parameters:	kb_name (str) – name of kb to be held in kb object df (pd.DataFrame) – contains the queries, responses and context strings answer_col (str) – column name string that points to responses query_col (str) – column name string that points to queries context_col (str) – column name string that points to context strings
Returns:	kb object

parse_pdf(PDF_file_path, header='', NumOfAppendix=0, kb_name='pdf_kb')¶

Function to convert PDFs to Dataframe with columns as index number & paragraphs.

Parameters:	PDF_file_path (str) – The filename and path of pdf header (str) – To remove the header in each page NumOfAppendix (int) – To remove the Appendix after the main content kb_name (str) – Name of returned kb object
Returns:	kb class object

parse_text(path, clause_sep='/n', inner_clause_sep='', query_idx=None, context_idx=None, kb_name=None)¶

Parse text file from kb path into query, response and mappings

Parameters:

path (str) – path to txt file, or raw text
clause_sep (str) – In the case that either query or context string is encoded within the first few sentences, inner_clause_sep may separate the sentences and query_idx and context_idx will select the query and context strings accordingly
inner_clause_sep (str) – See clause_sep
query_idx (int) – See clause_sep
context_idx (int) – See clause_sep
kb_name (str) – name of output kb object

Returns:

kb class object

preview(path, N=20)¶: Print the first N lines of the file in path

src.data_handler.kb_handler.unique_indexing(non_unique)¶

Convert a non_unique string pd.Series into a list of its indices of its unique list

Parameters:	non_unique (pd.Series) – containing non_unique values
Returns:	list contains the index of non unique values indexed by the unique values