Saga (p.saga.1.0.0~alpha2.doc.saga.Saga)

Library Overview

Saga consists of three main components:

tokenization: Fast tokenization with BPE, WordPiece, and custom methods
io: Efficient file I/O utilities for large text corpora
sampling: Advanced text generation with composable processors

All components work together seamlessly but can be used independently.

Quick Start

Advanced text generation

  (* Create a model function (typically a neural network) *)
  let model_fn token_ids =
    (* Your neural network forward pass *)
    Array.make 50000 0.0 (* Example: uniform logits *)

  (* Build a tokenizer *)
  let tok = Tokenizer.chars ()

  (* Create encoder and decoder functions *)
  let tokenizer_fn text =
    Tokenizer.encode tok text |> Encoding.get_ids |> Array.to_list
  in
  let decoder_fn ids =
    Tokenizer.decode tok ids
  in

  (* Configure generation with custom processors *)
  let config =
    Sampler.default
    |> Sampler.with_temperature 0.9
    |> Sampler.with_top_k 40
    |> Sampler.with_repetition_penalty 1.1

  (* Generate with fine-grained control *)
  let result =
    Sampler.generate_text ~model:model_fn ~tokenizer:tokenizer_fn
      ~decoder:decoder_fn ~prompt:"Hello" ~generation_config:config ()

Performance Tips

Use read_lines_lazy for very large files to avoid memory issues
BPE and WordPiece tokenizers handle out-of-vocabulary words better than simple word splitting
Batch encoding with padding is more efficient than encoding sequences one at a time

Tokenization

Fast and flexible tokenization supporting multiple algorithms and custom patterns. Handles everything from simple word splitting to advanced subword tokenization.

Quick Start

Load a pretrained tokenizer:

  let tokenizer = Tokenizer.from_file "tokenizer.json" |> Result.get_ok in
  let encoding = Tokenizer.encode tokenizer "Hello world!" in
  let ids = Encoding.get_ids encoding

Create a BPE tokenizer from scratch:

  let tokenizer =
    Tokenizer.bpe
      ~vocab:[("hello", 0); ("world", 1); ("[PAD]", 2)]
      ~merges:[]
      ()
  in
  let encoding = Tokenizer.encode tokenizer "hello world" in
  let text = Tokenizer.decode tokenizer [0; 1]

Train a new tokenizer:

  let texts = [ "Hello world"; "How are you?"; "Hello again" ] in
  let tokenizer =
    Tokenizer.train_bpe (`Seq (List.to_seq texts)) ~vocab_size:1000 ()
  in
  Tokenizer.save_pretrained tokenizer ~path:"./my_tokenizer"

Architecture

Tokenization proceeds through stages:

Normalization: Clean and normalize text (lowercase, accent removal, etc.)
Pre-tokenization: Split text into words or subwords
Tokenization: Apply vocabulary-based encoding (BPE, WordPiece, etc.)
Post-processing: Add special tokens, set type IDs
Padding/Truncation: Adjust length for batching

Each stage is optional and configurable via builder methods.

Post-processing patterns are model-specific:

BERT: Adds CLS at start, SEP at end, type IDs distinguish sequences
GPT-2: No special tokens by default, uses BOS/EOS if configured
RoBERTa: Uses <s> and </s> tokens similar to BERT but different format

Sourcemodule Unicode = Saga_tokenizers.Unicode

Unicode utilities for normalization.

Sourcemodule Normalizers = Saga_tokenizers.Normalizers

Text normalization (lowercase, NFD/NFC, accent stripping, etc.).

Sourcemodule Pre_tokenizers = Saga_tokenizers.Pre_tokenizers

Pre-tokenization (whitespace splitting, punctuation handling, etc.).

Sourcemodule Processors = Saga_tokenizers.Processors

Post-processing (adding CLS/SEP, setting type IDs, etc.).

Sourcemodule Decoders = Saga_tokenizers.Decoders

Decoding token IDs back to text.

Sourcemodule Encoding = Saga_tokenizers.Encoding

Encoding representation (output of tokenization).

Sourcetype direction = [

| `Left
| `Right

]

Direction for padding or truncation: `Left (beginning) or `Right (end).

Sourcetype special = {

token : string;
(*
The token text (e.g., "<pad>", "<unk>").
*)
single_word : bool;
(*
Whether this token must match whole words only. Default: false.
*)
lstrip : bool;
(*
Whether to strip whitespace on the left. Default: false.
*)
rstrip : bool;
(*
Whether to strip whitespace on the right. Default: false.
*)
normalized : bool;
(*
Whether to apply normalization to this token. Default: true for regular tokens, false for special tokens.
*)

}

Special token configuration.

Special tokens are not split during tokenization and can be skipped during decoding. Token IDs are assigned automatically when added to the vocabulary.

All special token types are uniform - the semantic meaning (pad, unk, bos, etc.) is contextual, not encoded in the type.

Sourcetype pad_length = [

| `Batch_longest
| `Fixed of int
| `To_multiple of int

]

Padding length strategy.

`Batch_longest: Pad to longest sequence in batch
`Fixed n: Pad all sequences to fixed length n
`To_multiple n: Pad to smallest multiple of n >= sequence length

Sourcetype padding = {

length : pad_length;
direction : direction;
pad_id : int option;
pad_type_id : int option;
pad_token : string option;

}

Padding configuration.

When optional fields are None, falls back to tokenizer's configured padding token. If the tokenizer has no padding token configured and these fields are None, padding operations will raise Invalid_argument.

Sourcetype truncation = {

max_length : int;
direction : direction;

}

Truncation configuration.

Limits sequences to max_length tokens, removing from specified direction.

Sourcetype data = [

| `Files of string list
| `Seq of string Seq.t
| `Iterator of unit -> string option

]

Training data source.

`Files paths: Read training text from files
`Seq seq: Use sequence of strings
`Iterator f: Pull training data via iterator (None signals end)

Special Token Constructors

Sourcemodule Special : sig ... end

Sourcemodule Tokenizer : sig ... end

File I/O

Efficient file I/O utilities optimized for large text corpora and ML workflows.

Sourceval read_lines : ?buffer_size:int -> string -> string list

read_lines ?buffer_size filename efficiently reads all lines from a file.

parameter buffer_size
Size of the read buffer in bytes (default: 65536)

returns
List of lines without trailing newlines

raises Sys_error
if file cannot be opened or read

Features:

Efficient buffered reading for large files
Automatic resource cleanup on errors
Windows/Unix line ending compatibility
Memory-efficient for files with many lines

Sourceval read_lines_lazy : ?buffer_size:int -> string -> string Seq.t

read_lines_lazy ?buffer_size filename returns a lazy sequence of lines.

parameter buffer_size
Size of the read buffer in bytes (default: 65536)

returns
Lazy sequence of lines that are read on-demand

raises Sys_error
if file cannot be opened

Use this for very large files to avoid loading everything into memory. The file is automatically closed when the sequence is fully consumed or when an error occurs.

Note: If the sequence is only partially consumed and then abandoned (e.g., using Seq.take), the file descriptor may remain open until garbage collection. For guaranteed cleanup, consume the sequence fully or handle resources explicitly.

Sourceval write_lines : ?append:bool -> string -> string list -> unit

write_lines ?append filename lines writes lines to a file.

parameter append
If true, append to existing file (default: false)

parameter filename
Target file path

parameter lines
List of lines to write (newlines are added automatically)

raises Sys_error
if file cannot be written

Advanced Text Generation

Modern text generation with composable processors and fine-grained control, designed for integration with neural language models.

Sourcemodule Sampler : sig ... end

Advanced text generation and sampling utilities.

N-grams

Sourcemodule Ngram : sig ... end

Low-level n-gram language models.

Examples

Quick tokenization

  open Saga

  (* Character tokenization *)
  let tok = Tokenizer.chars ()
  let enc = Tokenizer.encode tok "Hello world!"
  let ids = Encoding.get_ids enc
  let text = Tokenizer.decode tok (Array.to_list ids)

  (* BPE tokenization with batch processing *)
  let tok = Tokenizer.from_file "tokenizer.json" |> Result.get_ok
  let batch_enc = Tokenizer.encode_batch tok [ "Hello"; "World" ]

Neural Model Integration

  (* Wraps a neural model for Sampler integration. Example - illustrative
     pseudocode, adapt to your model API. *)
  let setup_neural_generation neural_model =
    let tok = Tokenizer.from_file "tokenizer.json" |> Result.get_ok in

    (* Model function: token_ids -> logits *)
    let model_fn token_ids =
      (* Convert to your model's input format *)
      let input_tensor = your_tensor_creation_fn token_ids in
      let output_tensor = neural_model input_tensor in
      (* Convert output to float array *)
      your_tensor_to_array_fn output_tensor
    in

    (* Configure generation with custom processors *)
    let config =
      Sampler.creative_writing
      |> Sampler.with_max_new_tokens 200
      |> Sampler.with_repetition_penalty 1.15
    in

    (* Generate text *)
    Sampler.generate_text ~model:model_fn ~tokenizer:tok ~prompt:"Hello"
      ~generation_config:config ()