Saga.NgramSourceLow-level n-gram language models.
Provides n-gram models with configurable smoothing strategies for handling unseen contexts. Operates on pre-tokenized integer sequences for maximum efficiency.
Low-level n-gram language models over integer token sequences.
Implements n-gram models with configurable smoothing for language modeling, text generation, and perplexity evaluation. Operates on pre-tokenized integer sequences for efficiency.
An n-gram model predicts the probability of a token given the previous n-1 tokens. The model maintains counts of all observed n-grams during training and uses smoothing to handle unseen contexts.
Key features:
add_sequenceTrain a trigram model:
(* Pre-tokenized sequences as integer arrays *)
let sequences = [
[|0; 1; 2; 3|]; (* "the cat sat down" *)
[|0; 4; 5; 3|] (* "the dog ran down" *)
] in
let model = Ngram.of_sequences ~order:3 sequencesCompute next-token probabilities:
let context = [|0; 1|] in (* "the cat" *)
let logits = Ngram.logits model ~context in
(* logits.(token) = log P(token | context) *)Evaluate model quality:
let test_seq = [| 0; 1; 2 |] in
let ppl = Ngram.perplexity model test_seq in
Printf.printf "Perplexity: %.2f\n" pplAdd-k smoothing adds a constant k to all counts before normalization. This prevents zero probabilities but can over-smooth for large vocabularies.
Stupid backoff recursively falls back to lower-order n-grams when a context is unseen, multiplying by a scaling factor alpha at each backoff level. More efficient than interpolation-based smoothing.
Smoothing strategy for handling unseen n-grams.
\`Add_k k: Add-k (Laplace) smoothing. Adds k to all counts. Common values are 0.01 to 0.1. Larger values increase smoothing strength.\`Stupid_backoff alpha: Backoff to lower-order n-grams with scaling factor alpha. Typical values are 0.4 to 0.6. Does not produce normalized probabilities but works well in practice.Summary statistics describing the trained model.
vocab_size: Number of unique token IDs observed during training. This is the size of the output space for logits.total_tokens: Total count of all tokens across all training sequences, including overlaps.unique_ngrams: Count of distinct highest-order n-grams. Lower-order n-grams are not included in this count.N-gram language model.
Stores n-gram counts and smoothing configuration. Training functions mutate internal counts but return the same model for convenience in pipelines.
empty ~order () creates an untrained model of the given order.
of_sequences ~order sequences builds and trains a model in one step.
Equivalent to creating an empty model and calling add_sequence on each element of sequences.
add_sequence model tokens updates the model with a single token sequence.
Increments counts for all n-grams observed in tokens. Token IDs must be non-negative integers. The vocabulary size expands automatically to include any new token IDs.
Returns the same model (which has been mutated) for use in pipelines.
Example
let model = Ngram.empty ~order:2 () in
let model = add_sequence model [|0; 1; 2|] in
let model = add_sequence model [|0; 2; 1|]is_trained model returns true if the model has seen at least one token.
Untrained models cannot compute probabilities.
stats model returns summary statistics about the trained model.
Useful for diagnosing model size and vocabulary coverage.
logits model ~context computes log probabilities for the next token.
Returns an array where logits.(token) is the natural logarithm of P(token | context). The array length equals the vocabulary size.
The context is automatically truncated or padded to match the model order. If context is shorter than order - 1, the model uses the available context. For unigram models (order=1), context is ignored.
Smoothing behavior:
-infinity for tokens unseen at all orders.log_prob model tokens computes the log probability of a token sequence.
Returns the sum of log probabilities for each token given its context. The first order - 1 tokens are skipped since they lack full context.
For a sequence t{_ 0}, t{_ 1}, ..., t{_ n}, computes: sumi=order-1n log P(ti | ti-order+1, ..., ti-1)
Token IDs outside the vocabulary are skipped.
perplexity model tokens computes per-token perplexity.
Returns exp(-log_prob / N) where N is the number of tokens scored (length minus order - 1). Lower perplexity indicates better model fit.
Returns infinity for empty sequences.
Example
let ppl = Ngram.perplexity model test_tokens in
Printf.printf "Perplexity: %.2f\n" ppl
(* Lower is better; typical values range from 10 to 1000+ *)save model path serializes the model to a binary file.
Uses OCaml's Marshal module for serialization. The file includes all n-gram counts, vocabulary size, and configuration.