Saga_tokenizers.NormalizersSourceText normalization module matching HuggingFace tokenizers.
Normalizers are responsible for cleaning and transforming text before tokenization. This includes operations like lowercasing, accent removal, Unicode normalization, and handling special characters.
type normalized_string = {normalized : string;The normalized text
*)original : string;The original text
*)alignments : (int * int) array;Alignment mappings from normalized to original positions
*)}Type representing a normalized string with alignment information
Main normalizer type
val bert :
?clean_text:bool ->
?handle_chinese_chars:bool ->
?strip_accents:bool option ->
?lowercase:bool ->
unit ->
tCreate a BERT normalizer.
Unicode NFC (Canonical Decomposition, followed by Canonical Composition) normalizer
Unicode NFKC (Compatibility Decomposition, followed by Canonical Composition) normalizer
Create a byte-level normalizer.
Apply normalization to a string, preserving alignment information
Apply normalization to a string, returning only the normalized text
Convert normalizer to JSON representation
Create normalizer from JSON representation