Saga_tokenizers.BpeSourceByte Pair Encoding (BPE) tokenization module
BPE model
List of merge operations
type config = {vocab : vocab;merges : merges;cache_capacity : int;dropout : float option;unk_token : string option;continuing_subword_prefix : string option;end_of_word_suffix : string option;fuse_unk : bool;byte_fallback : bool;ignore_merges : bool;}BPE configuration
from_files ~vocab_file ~merges_file loads a BPE model from vocab.json and merges.txt files
Token with ID, string value, and character offsets
get_vocab model returns the vocabulary as a list of (token, id) pairs
get_unk_token model returns the unknown token if configured
get_continuing_subword_prefix model returns the continuing subword prefix if configured
get_end_of_word_suffix model returns the end-of-word suffix if configured
save model ~path ?name () saves the model to vocab.json and merges.txt files
read_files ~vocab_file ~merges_file reads vocabulary and merges from files