Data Handlers¶
GoldenRetriever’s knowledge bases can be preprocessed, parsed and uploaded into elasticsearch with the following subpackages.
Elasticsearch Interface¶
Knowledge Base Handler¶
kb_handler converts knowledge bases in various data types (txt, csv, sql) into a kb object. This kb object may then be further used for finetuning and eval
-
src.data_handler.kb_handler.generate_mappings(responses, queries)¶ Generate a list of list mappings between responses and queries The length of responses and queries must be the same. To note, the argument takes Responses then Queries but the returned mappings list Queries then Responses for convenient use in downstream scripts
Parameters: - responses (pd.Series) – contains query strings, may be non unique
- queries (pd.Series) – contains query strings, may be non unique
Returns: mappings that is a list of list of ints containing mappings between queries responses
-
class
src.data_handler.kb_handler.kb(name, responses, queries, mapping, vectorised_responses=None)¶ Bases:
object-
create_df()¶ Create pandas DataFrame in a similar format to dataloader.py which may be used for finetuning and evaluation.
Importantly, if there is many-to-many matches between Queries and Responses the returned dataframe will have duplicates
Returns: pd.DataFrame that contains the columns query_string, processed_string, kb_name
-
json(hashkey=None)¶ Create json dict to use with flask endpoint
-
-
class
src.data_handler.kb_handler.kb_handler¶ Bases:
objectkb_handler loads knowledge bases from text files
-
load_es_kb(kb_names=[])¶ Load the knowledge bases from elasticsearch
Parameters: kb_names (list) – to list specific kb_names to parse else if empty, parse all of them Returns: list of kb class objects
-
parse_csv(path, answer_col='', query_col='', context_col='', kb_name='')¶ Parse CSV file into kb format As pandas leverages csv.sniff to parse the csv, this function leverages pandas.
Parameters: - kb_name (str) – name of output kb object
- df (pd.DataFrame) – contains the queries, responses and context strings
- answer_col (str) – column name string that points to responses
- query_col (str) – column name string that points to queries
- context_col (str) – column name string that points to context strings
Returns: kb class object
-
parse_df(kb_name, df, answer_col, query_col='', context_col='')¶ parses pandas DataFrame into responses, queries and mappings
Parameters: - kb_name (str) – name of kb to be held in kb object
- df (pd.DataFrame) – contains the queries, responses and context strings
- answer_col (str) – column name string that points to responses
- query_col (str) – column name string that points to queries
- context_col (str) – column name string that points to context strings
Returns: kb object
-
parse_pdf(PDF_file_path, header='', NumOfAppendix=0, kb_name='pdf_kb')¶ Function to convert PDFs to Dataframe with columns as index number & paragraphs.
Parameters: - PDF_file_path (str) – The filename and path of pdf
- header (str) – To remove the header in each page
- NumOfAppendix (int) – To remove the Appendix after the main content
- kb_name (str) – Name of returned kb object
Returns: kb class object
-
parse_text(path, clause_sep='/n', inner_clause_sep='', query_idx=None, context_idx=None, kb_name=None)¶ Parse text file from kb path into query, response and mappings
Parameters: - path (str) – path to txt file, or raw text
- clause_sep (str) – In the case that either query or context string is encoded within the first few sentences, inner_clause_sep may separate the sentences and query_idx and context_idx will select the query and context strings accordingly
- inner_clause_sep (str) – See clause_sep
- query_idx (int) – See clause_sep
- context_idx (int) – See clause_sep
- kb_name (str) – name of output kb object
Returns: kb class object
-
preview(path, N=20)¶ Print the first N lines of the file in path
-
-
src.data_handler.kb_handler.unique_indexing(non_unique)¶ Convert a non_unique string pd.Series into a list of its indices of its unique list
Parameters: non_unique (pd.Series) – containing non_unique values Returns: list contains the index of non unique values indexed by the unique values