Saga_tokenizers.TokenizerSourceMain tokenizer type.
type padding_config = {direction : direction;pad_id : int;pad_type_id : int;pad_token : string;length : int option;pad_to_multiple_of : int option;}Record for padding config.
Record for truncation config.
From pretrained with result and defaults.
Configuration
Set normalizer.
Get normalizer.
Set pre-tokenizer.
Get pre-tokenizer.
Set post-processor.
Get post-processor.
Set decoder.
Get decoder.
Padding and Truncation
Enable padding with record config.
Get padding config.
Enable truncation with record config.
Get truncation config.
Vocabulary Management
Add tokens, return count added.
Add special tokens.
Get vocab list with default.
Get added tokens.
Training
Train from files.
val train_from_iterator :
t ->
string Seq.t ->
?trainer:Trainers.t ->
?length:int ->
unit ->
unitTrain from text sequence.
val encode :
t ->
sequence:(string, string list) Either.t ->
?pair:(string, string list) Either.t ->
?is_pretokenized:bool ->
?add_special_tokens:bool ->
unit ->
Encoding.tEncoding and Decoding
Encode single or pair, allowing pretokenized lists.
val encode_batch :
t ->
input:
((string, string list) Either.t,
(string, string list) Either.t * (string, string list) Either.t)
Either.t
list ->
?is_pretokenized:bool ->
?add_special_tokens:bool ->
unit ->
Encoding.t listBatch encode with flexible inputs.
val decode :
t ->
int list ->
?skip_special_tokens:bool ->
?clean_up_tokenization_spaces:bool ->
unit ->
stringDecode with defaults.
val decode_batch :
t ->
int list list ->
?skip_special_tokens:bool ->
?clean_up_tokenization_spaces:bool ->
unit ->
string listBatch decode with defaults.
val post_process :
t ->
encoding:Encoding.t ->
?pair:Encoding.t ->
?add_special_tokens:bool ->
unit ->
Encoding.tPost-process manually.
Serialization
Save to file with pretty default.