Data Handlers

GoldenRetriever’s knowledge bases can be preprocessed, parsed and uploaded into elasticsearch with the following subpackages.

Elasticsearch Interface

Knowledge Base Handler

kb_handler converts knowledge bases in various data types (txt, csv, sql) into a kb object. This kb object may then be further used for finetuning and eval

src.data_handler.kb_handler.generate_mappings(responses, queries)

Generate a list of list mappings between responses and queries The length of responses and queries must be the same. To note, the argument takes Responses then Queries but the returned mappings list Queries then Responses for convenient use in downstream scripts

Parameters:
  • responses (pd.Series) – contains query strings, may be non unique
  • queries (pd.Series) – contains query strings, may be non unique
Returns:

mappings that is a list of list of ints containing mappings between queries responses

class src.data_handler.kb_handler.kb(name, responses, queries, mapping, vectorised_responses=None)

Bases: object

create_df()

Create pandas DataFrame in a similar format to dataloader.py which may be used for finetuning and evaluation.

Importantly, if there is many-to-many matches between Queries and Responses the returned dataframe will have duplicates

Returns:pd.DataFrame that contains the columns query_string, processed_string, kb_name
json(hashkey=None)

Create json dict to use with flask endpoint

class src.data_handler.kb_handler.kb_handler

Bases: object

kb_handler loads knowledge bases from text files

load_es_kb(kb_names=[])

Load the knowledge bases from elasticsearch

Parameters:kb_names (list) – to list specific kb_names to parse else if empty, parse all of them
Returns:list of kb class objects
parse_csv(path, answer_col='', query_col='', context_col='', kb_name='')

Parse CSV file into kb format As pandas leverages csv.sniff to parse the csv, this function leverages pandas.

Parameters:
  • kb_name (str) – name of output kb object
  • df (pd.DataFrame) – contains the queries, responses and context strings
  • answer_col (str) – column name string that points to responses
  • query_col (str) – column name string that points to queries
  • context_col (str) – column name string that points to context strings
Returns:

kb class object

parse_df(kb_name, df, answer_col, query_col='', context_col='')

parses pandas DataFrame into responses, queries and mappings

Parameters:
  • kb_name (str) – name of kb to be held in kb object
  • df (pd.DataFrame) – contains the queries, responses and context strings
  • answer_col (str) – column name string that points to responses
  • query_col (str) – column name string that points to queries
  • context_col (str) – column name string that points to context strings
Returns:

kb object

parse_pdf(PDF_file_path, header='', NumOfAppendix=0, kb_name='pdf_kb')

Function to convert PDFs to Dataframe with columns as index number & paragraphs.

Parameters:
  • PDF_file_path (str) – The filename and path of pdf
  • header (str) – To remove the header in each page
  • NumOfAppendix (int) – To remove the Appendix after the main content
  • kb_name (str) – Name of returned kb object
Returns:

kb class object

parse_text(path, clause_sep='/n', inner_clause_sep='', query_idx=None, context_idx=None, kb_name=None)

Parse text file from kb path into query, response and mappings

Parameters:
  • path (str) – path to txt file, or raw text
  • clause_sep (str) – In the case that either query or context string is encoded within the first few sentences, inner_clause_sep may separate the sentences and query_idx and context_idx will select the query and context strings accordingly
  • inner_clause_sep (str) – See clause_sep
  • query_idx (int) – See clause_sep
  • context_idx (int) – See clause_sep
  • kb_name (str) – name of output kb object
Returns:

kb class object

preview(path, N=20)

Print the first N lines of the file in path

src.data_handler.kb_handler.unique_indexing(non_unique)

Convert a non_unique string pd.Series into a list of its indices of its unique list

Parameters:non_unique (pd.Series) – containing non_unique values
Returns:list contains the index of non unique values indexed by the unique values