src.tokenizers package

Submodules

src.tokenizers.bert_tokenization module

Tokenization classes implementation.

The file is forked from: https://github.com/google-research/bert/blob/master/tokenization.py.

class src.tokenizers.bert_tokenization.BasicTokenizer(do_lower_case=True)

Bases: object

Runs basic tokenization (punctuation splitting, lower casing, etc.).

tokenize(text)

Tokenizes a piece of text.

class src.tokenizers.bert_tokenization.FullTokenizer(vocab_file, do_lower_case=True)

Bases: object

Runs end-to-end tokenziation.

convert_ids_to_tokens(ids)
convert_tokens_to_ids(tokens)
tokenize(text)
class src.tokenizers.bert_tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)

Bases: object

Runs WordPiece tokenziation.

tokenize(text)

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:
input = “unaffable” output = [“un”, “##aff”, “##able”]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
src.tokenizers.bert_tokenization.convert_by_vocab(vocab, items)

Converts a sequence of [tokens|ids] using the vocab.

src.tokenizers.bert_tokenization.convert_ids_to_tokens(inv_vocab, ids)
src.tokenizers.bert_tokenization.convert_to_unicode(text)

Converts text to Unicode (if it’s not already), assuming utf-8 input.

src.tokenizers.bert_tokenization.convert_tokens_to_ids(vocab, tokens)
src.tokenizers.bert_tokenization.get_ids(tokens, tokenizer, max_seq_length)

Token ids from Tokenizer vocab

src.tokenizers.bert_tokenization.get_masks(tokens, max_seq_length)

Mask for padding

src.tokenizers.bert_tokenization.get_segments(tokens, max_seq_length)

Segments: 0 for the first sequence, 1 for the second

src.tokenizers.bert_tokenization.load_vocab(vocab_file)

Loads a vocabulary file into a dictionary.

src.tokenizers.bert_tokenization.preprocess_one_str(text, max_seq_length, tokenizer)

Convert strings into the 3 arrays that BERT takes in as input: tokens, masks, segments

src.tokenizers.bert_tokenization.preprocess_str(text, max_seq_length, tokenizer)

Preprocess string inputs or list of string inputs into their respective lists of int32 arrays.

Meant to be robust for whether the input is string or an iterable of strings

src.tokenizers.bert_tokenization.printable_text(text)

Returns text encoded in a way suitable for print or tf.logging.

src.tokenizers.bert_tokenization.truncate_str(text, max_seq_length)
src.tokenizers.bert_tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)

Checks whether the casing config is consistent with the checkpoint name.

src.tokenizers.bert_tokenization.whitespace_tokenize(text)

Runs basic whitespace cleaning and splitting on a piece of text.

Module contents