Saga_tokenizers.WordpieceSourceWordPiece tokenization module
WordPiece is the subword tokenization algorithm used by BERT. It uses a greedy longest-match-first algorithm to tokenize text.
WordPiece model
type config = {vocab : vocab;unk_token : string;continuing_subword_prefix : string;max_input_chars_per_word : int;}WordPiece configuration
create config creates a new WordPiece model with the given configuration
from_file ~vocab_file loads a WordPiece model from a vocab.txt file with default settings
val from_file_with_config :
vocab_file:string ->
unk_token:string ->
continuing_subword_prefix:string ->
max_input_chars_per_word:int ->
tfrom_file_with_config ~vocab_file ~unk_token ~continuing_subword_prefix ~max_input_chars_per_word loads a WordPiece model from a vocab.txt file with custom configuration
Token with ID, string value, and character offsets
tokenize model text tokenizes text into tokens using greedy longest-match-first
get_vocab model returns the vocabulary as a list of (token, id) pairs
get_continuing_subword_prefix model returns the continuing subword prefix
get_max_input_chars_per_word model returns the maximum input characters per word
save model ~path ?name () saves the model to vocab.txt file and returns the filepath
to_yojson model converts model to JSON
of_yojson json creates model from JSON