Module GlyphSource

Unicode glyphs for terminal rendering.

A glyph is a packed, unboxed integer representing a visual character in a terminal cell. Glyphs come in two kinds:

Multi-column characters (wide CJK, emoji) are represented as one start glyph followed by one or more continuation glyphs that reference the same pool entry. Control characters and zero-width sequences map to empty.

Quick start

Create a pool, encode a string, and process glyphs via callback:

  let pool = Pool.create () in
  Pool.encode pool ~width_method:`Unicode ~tab_width:2
    (fun glyph -> Printf.printf "%s " (Pool.to_string pool glyph))
    "Hello 👋 World"

Memory safety

The Pool uses manual reference counting with automatic slot recycling. Pool-backed glyph IDs include a generation counter so that accessing a glyph whose slot has been recycled returns safe defaults (empty, zero width) rather than stale data. This guarantee holds across normal Pool.incref/Pool.decref cycles. Pool.clear resets the pool and invalidates all previously issued IDs.

Width calculation

Display width follows UAX #11 and UAX #29, correctly handling ZWJ emoji sequences, regional indicator (flag) pairs, variation selectors, and skin-tone modifiers. See width_method for the available strategies.

Types

Sourcetype t = private int

The type for glyphs. A packed 63-bit integer, always unboxed.

The type is private to prevent construction of invalid values. Use of_uchar, Pool.intern, Pool.encode, empty, or space to create glyphs. The integer representation is readable (e.g. for storage in Bigarray); use unsafe_of_int when loading from external storage.

Note. The bit layout is not a stable serialization format across major versions.

Sourcetype width_method = [
  1. | `Unicode
  2. | `Wcwidth
  3. | `No_zwj
]

The type for width calculation methods. Determines how grapheme cluster display widths are computed:

  • `Unicode — full UAX #29 segmentation with ZWJ emoji composition. Use for correct emoji and flag rendering.
  • `Wcwidth — grapheme boundary segmentation for rendering, but each grapheme's width is the sum of per-codepoint wcwidth-style widths. Use for legacy compatibility.
  • `No_zwj — UAX #29 segmentation that forces a break after ZWJ (no emoji ZWJ sequences), but keeps the full grapheme-aware width logic (RI pairs, VS16, Indic virama).
Sourcetype line_break_kind = [
  1. | `LF
  2. | `CR
  3. | `CRLF
]

The type for line terminator kinds.

  • `LF — line feed (U+000A).
  • `CR — carriage return (U+000D).
  • `CRLF — the two-byte CR LF sequence.

Constants

Sourceval empty : t

empty is the empty glyph (0). It represents control characters, zero-width sequences, and U+0000. This is the only glyph for which is_empty is true.

Sourceval space : t

space is the space glyph (U+0020, width 1). It is the default blank-cell content in terminal grids.

Creating

Sourceval of_uchar : Uchar.t -> t

of_uchar u is a glyph for the single Unicode scalar u.

The result is empty for control or zero-width codepoints. Simple glyphs are stored directly in the packed integer with no pool allocation.

See also Pool.intern and Pool.encode.

Predicates

Sourceval is_empty : t -> bool

is_empty g is true iff g is empty.

Sourceval is_inline : t -> bool

is_inline g is true iff g requires no pool lookup. Useful for skipping reference counting on simple glyphs.

Sourceval is_start : t -> bool

is_start g is true iff g is the start of a character (simple or complex start).

Sourceval is_continuation : t -> bool

is_continuation g is true iff g is a wide-character continuation placeholder. See make_continuation.

Sourceval is_complex : t -> bool

is_complex g is true iff g is pool-backed (complex start or complex continuation).

Properties

Sourceval grapheme_width : ?tab_width:int -> t -> int

grapheme_width g is the full display width of the grapheme represented by g. For complex glyphs (start or continuation) the result is the total cluster width (1–4). For tab glyphs the result is tab_width.

tab_width defaults to 2.

See also cell_width.

Sourceval cell_width : t -> int

cell_width g is the display width that g occupies in a single cell. The result is 0 for empty and continuation cells. For start cells, the result is the character's display width (1 for most characters, 2 for wide CJK/emoji). Tab glyphs return 1.

Unlike grapheme_width, continuation cells return 0 because they occupy no additional columns beyond the start cell.

Sourceval left_extent : t -> int

left_extent g is the distance from a continuation cell to its start cell. The result is 0 for simple and complex-start glyphs.

Sourceval right_extent : t -> int

right_extent g is the distance from a glyph to the rightmost continuation cell. For a complex start glyph this is width - 1.

Sourceval codepoint : t -> int

codepoint g is the Unicode codepoint of a simple glyph g (U+0000 – U+10FFFF).

Warning. The result is undefined for complex glyphs.

Sourceval pool_key : t -> int option

pool_key g is Some key if g is a pool-backed glyph (complex start or continuation), and None otherwise. The key is a stable, process-local identity for deduplicating interned grapheme references.

The key is only meaningful for glyphs originating from the same pool.

Construction

Sourceval make_continuation : code:t -> left:int -> right:int -> t

make_continuation ~code ~left ~right is a continuation cell referencing the same pool entry as code with the given left and right extents. left and right are clamped to [0;3]. If code is a simple glyph the continuation carries no pool reference.

Note. Intended for renderer and grid internals that materialize wide-cell spans.

Converting

Sourceval to_int : t -> int

to_int g is the raw integer representation of g.

Note. The integer layout is not a stable serialization format across major versions. Use for in-process storage only (e.g. Bigarray).

See also unsafe_of_int.

Sourceval unsafe_of_int : int -> t

unsafe_of_int n is n interpreted as a glyph without validation.

Warning. The caller must ensure n was produced by to_int or read from trusted storage. An invalid integer causes undefined behaviour in pool operations.

See also to_int.

Pool

A Pool.t manages the storage and lifecycle of complex glyphs (multi-codepoint grapheme clusters) through manual reference counting with generation-based use-after-free protection.

Warning. Pools are not thread-safe. Use one pool per thread or provide external synchronization.

Sourcemodule Pool : sig ... end

String utilities

Pool-free measurement and iteration on raw string values. These functions do not require a Pool.t.

Sourcemodule String : sig ... end