The Parseff.Utf8 module provides primitives that operate on Unicode code points (Uchar.t) instead of byte sequences (string). The input is still an OCaml string, but characters are decoded as UTF-8 sequences.
Primitives operate at the code point level. A base character followed by a combining accent is two separate code points. Parseff.Utf8.satisfy returns them individually. For grapheme-level parsing, compose the code-point primitives with a grapheme segmentation library like uuseg.
UTF-8 and byte-level primitives can be freely mixed in the same parser. Use byte-level Parseff.consume or Parseff.satisfy for ASCII structural tokens, and Parseff.Utf8 for multilingual text content:
let field () =
let key = Parseff.take_while ~at_least:1
(fun c -> c >= 'a' && c <= 'z')
~label:"key"
in
let _ = Parseff.consume ":" in
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
~label:"value"
(fun u -> Uchar.to_int u <> 0x0A)
in
(key, value)satisfyParseff.Utf8.satisfy decodes the next UTF-8 code point and tests it against a predicate. Advances by 1--4 bytes depending on the encoding.
val satisfy : (Uchar.t -> bool) -> label:string -> Uchar.t(* Match any CJK Unified Ideograph *)
let cjk_char () =
Parseff.Utf8.satisfy
(fun u ->
let i = Uchar.to_int u in
i >= 0x4E00 && i <= 0x9FFF)
~label:"CJK character"charParseff.Utf8.char matches an exact Unicode code point.
val char : Uchar.t -> Uchar.tlet lambda () = Parseff.Utf8.char (Uchar.of_int 0x03BB) (* λ *)
let arrow () = Parseff.Utf8.char (Uchar.of_int 0x2192) (* → *)take_whileParseff.Utf8.take_while consumes code points while the predicate holds. Returns the matched UTF-8 bytes as a string. The optional ~at_least parameter counts code points, not bytes — users think in characters when using Unicode primitives, so the count matches that mental model.
val take_while : ?at_least:int -> ?label:string -> (Uchar.t -> bool) -> string(* Parse a Unicode word *)
let word () =
Parseff.Utf8.take_while
Uucp.Alpha.is_alphabetic
~at_least:1
~label:"letter"
(* Parses "hello", "café", "東京", "Москва", etc. *)skip_whileParseff.Utf8.skip_while advances past code points without building a string. More efficient than Parseff.Utf8.take_while when you don't need the result.
val skip_while : (Uchar.t -> bool) -> unittake_while_spanParseff.Utf8.take_while_span returns a zero-copy Parseff.span instead of allocating a new string. Use Parseff.span_to_string to materialize when needed.
val take_while_span : (Uchar.t -> bool) -> spanskip_while_then_charParseff.Utf8.skip_while_then_char skips code points matching a predicate, then matches a specific terminating code point. More efficient than calling Parseff.Utf8.skip_while followed by Parseff.Utf8.char separately.
val skip_while_then_char : (Uchar.t -> bool) -> Uchar.t -> unitThese are built on top of the structural primitives using Unicode character properties from the uucp library.
letterParseff.Utf8.letter matches any Unicode alphabetic character using Uucp.Alpha.is_alphabetic. This covers Latin, Greek, Cyrillic, CJK, Arabic, Devanagari, and all other Unicode scripts.
let l = Parseff.Utf8.letter ()
(* Matches: 'a', 'é', 'λ', '中', 'д', 'ع', 'अ', ... *)digitParseff.Utf8.digit matches ASCII digits 0--9 only and returns an int. Unicode digit categories (Nd) include Arabic-Indic, Devanagari, and other numeral systems where mapping to int is non-trivial. Keeping it ASCII-only makes the return value unambiguous. For Unicode digit handling, use Parseff.Utf8.satisfy with a custom predicate.
alphanumParseff.Utf8.alphanum matches a Unicode alphabetic character or an ASCII digit. Combines Uucp.Alpha.is_alphabetic with the ASCII digit range.
whitespace and skip_whitespaceParseff.Utf8.whitespace and Parseff.Utf8.skip_whitespace use the full Unicode White_Space property (Uucp.White.is_white_space). This includes ASCII whitespace plus:
The ~at_least parameter on Parseff.Utf8.whitespace counts code points, not bytes.
(* Skip any Unicode whitespace before a value *)
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
Uucp.Alpha.is_alphabetic
~label:"word"is_whitespaceParseff.Utf8.is_whitespace exposes the Unicode whitespace predicate for use with Parseff.Utf8.take_while or Parseff.Utf8.skip_while directly.
All UTF-8 primitives raise a parse error when they encounter an invalid byte sequence. This includes:
The error message is "invalid UTF-8" and the position points to the first invalid byte.
Positions remain byte offsets, consistent with the rest of parseff. A single Parseff.Utf8.satisfy call advances the position by 1--4 bytes depending on the UTF-8 encoding of the matched code point. Parseff.position always returns a byte offset.
All UTF-8 primitives work with streaming input (Parseff.parse_source and Parseff.parse_source_until_end). Multi-byte UTF-8 sequences that span chunk boundaries are handled correctly, the streaming handlers ensure enough bytes are available before decoding each code point.