Saga_tokenizers.ProcessorsSourcePost-processing module for tokenization output.
Post-processors handle special tokens and formatting after tokenization, such as adding CLS and SEP tokens for BERT, or handling sentence pairs.
type encoding = {ids : int array;type_ids : int array;tokens : string array;offsets : (int * int) array;special_tokens_mask : int array;attention_mask : int array;overflowing : encoding list;sequence_ranges : (int * int * int) list;}Type representing an encoding to be processed
Main post-processor type
val roberta :
sep:(string * int) ->
cls:(string * int) ->
?trim_offsets:bool ->
?add_prefix_space:bool ->
unit ->
tCreate a RoBERTa post-processor.
val template :
single:string ->
?pair:string ->
?special_tokens:(string * int) list ->
unit ->
tCreate a template post-processor.
Process encodings with the post-processor.
Get the number of tokens added by this post-processor.
Convert post-processor to JSON representation
Create post-processor from JSON representation