src.tokenizers package¶
Submodules¶
src.tokenizers.bert_tokenization module¶
Tokenization classes implementation.
The file is forked from: https://github.com/google-research/bert/blob/master/tokenization.py.
-
class
src.tokenizers.bert_tokenization.BasicTokenizer(do_lower_case=True)¶ Bases:
objectRuns basic tokenization (punctuation splitting, lower casing, etc.).
-
tokenize(text)¶ Tokenizes a piece of text.
-
-
class
src.tokenizers.bert_tokenization.FullTokenizer(vocab_file, do_lower_case=True)¶ Bases:
objectRuns end-to-end tokenziation.
-
convert_ids_to_tokens(ids)¶
-
convert_tokens_to_ids(tokens)¶
-
tokenize(text)¶
-
-
class
src.tokenizers.bert_tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)¶ Bases:
objectRuns WordPiece tokenziation.
-
tokenize(text)¶ Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
- For example:
- input = “unaffable” output = [“un”, “##aff”, “##able”]
- Args:
- text: A single token or whitespace separated tokens. This should have
- already been passed through `BasicTokenizer.
- Returns:
- A list of wordpiece tokens.
-
-
src.tokenizers.bert_tokenization.convert_by_vocab(vocab, items)¶ Converts a sequence of [tokens|ids] using the vocab.
-
src.tokenizers.bert_tokenization.convert_ids_to_tokens(inv_vocab, ids)¶
-
src.tokenizers.bert_tokenization.convert_to_unicode(text)¶ Converts text to Unicode (if it’s not already), assuming utf-8 input.
-
src.tokenizers.bert_tokenization.convert_tokens_to_ids(vocab, tokens)¶
-
src.tokenizers.bert_tokenization.get_ids(tokens, tokenizer, max_seq_length)¶ Token ids from Tokenizer vocab
-
src.tokenizers.bert_tokenization.get_masks(tokens, max_seq_length)¶ Mask for padding
-
src.tokenizers.bert_tokenization.get_segments(tokens, max_seq_length)¶ Segments: 0 for the first sequence, 1 for the second
-
src.tokenizers.bert_tokenization.load_vocab(vocab_file)¶ Loads a vocabulary file into a dictionary.
-
src.tokenizers.bert_tokenization.preprocess_one_str(text, max_seq_length, tokenizer)¶ Convert strings into the 3 arrays that BERT takes in as input: tokens, masks, segments
-
src.tokenizers.bert_tokenization.preprocess_str(text, max_seq_length, tokenizer)¶ Preprocess string inputs or list of string inputs into their respective lists of int32 arrays.
Meant to be robust for whether the input is string or an iterable of strings
-
src.tokenizers.bert_tokenization.printable_text(text)¶ Returns text encoded in a way suitable for print or tf.logging.
-
src.tokenizers.bert_tokenization.truncate_str(text, max_seq_length)¶
-
src.tokenizers.bert_tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)¶ Checks whether the casing config is consistent with the checkpoint name.
-
src.tokenizers.bert_tokenization.whitespace_tokenize(text)¶ Runs basic whitespace cleaning and splitting on a piece of text.