utf8 (p.parseff.0.3.0.doc.utf8)

Mixing with byte-level primitives

UTF-8 and byte-level primitives can be freely mixed in the same parser. Use byte-level Parseff.consume or Parseff.satisfy for ASCII structural tokens, and Parseff.Utf8 for multilingual text content:

let field () =
  let key = Parseff.take_while ~at_least:1
    (fun c -> c >= 'a' && c <= 'z')
    ~label:"key"
  in
  let _ = Parseff.consume ":" in
  Parseff.Utf8.skip_whitespace ();
  let value = Parseff.Utf8.take_while ~at_least:1
    ~label:"value"
    (fun u -> Uchar.to_int u <> 0x0A)
  in
  (key, value)

`satisfy`

Parseff.Utf8.satisfy decodes the next UTF-8 code point and tests it against a predicate. Advances by 1--4 bytes depending on the encoding.

val satisfy : (Uchar.t -> bool) -> label:string -> Uchar.t

(* Match any CJK Unified Ideograph *)
let cjk_char () =
  Parseff.Utf8.satisfy
    (fun u ->
      let i = Uchar.to_int u in
      i >= 0x4E00 && i <= 0x9FFF)
    ~label:"CJK character"

`char`

Parseff.Utf8.char matches an exact Unicode code point.

val char : Uchar.t -> Uchar.t

let lambda () = Parseff.Utf8.char (Uchar.of_int 0x03BB)  (* λ *)
let arrow () = Parseff.Utf8.char (Uchar.of_int 0x2192)   (* → *)

`take_while`

Parseff.Utf8.take_while consumes code points while the predicate holds. Returns the matched UTF-8 bytes as a string. The optional ~at_least parameter counts code points, not bytes — users think in characters when using Unicode primitives, so the count matches that mental model.

val take_while : ?at_least:int -> ?label:string -> (Uchar.t -> bool) -> string

(* Parse a Unicode word *)
let word () =
  Parseff.Utf8.take_while
    Uucp.Alpha.is_alphabetic
    ~at_least:1
    ~label:"letter"

(* Parses "hello", "café", "東京", "Москва", etc. *)

`skip_while`

Parseff.Utf8.skip_while advances past code points without building a string. More efficient than Parseff.Utf8.take_while when you don't need the result.

val skip_while : (Uchar.t -> bool) -> unit

`take_while_span`

Parseff.Utf8.take_while_span returns a zero-copy Parseff.span instead of allocating a new string. Use Parseff.span_to_string to materialize when needed.

val take_while_span : (Uchar.t -> bool) -> span

`skip_while_then_char`

Parseff.Utf8.skip_while_then_char skips code points matching a predicate, then matches a specific terminating code point. More efficient than calling Parseff.Utf8.skip_while followed by Parseff.Utf8.char separately.

val skip_while_then_char : (Uchar.t -> bool) -> Uchar.t -> unit

Convenience combinators

These are built on top of the structural primitives using Unicode character properties from the uucp library.

`letter`

Parseff.Utf8.letter matches any Unicode alphabetic character using Uucp.Alpha.is_alphabetic. This covers Latin, Greek, Cyrillic, CJK, Arabic, Devanagari, and all other Unicode scripts.

let l = Parseff.Utf8.letter ()
(* Matches: 'a', 'é', 'λ', '中', 'д', 'ع', 'अ', ... *)

`digit`

Parseff.Utf8.digit matches ASCII digits 0--9 only and returns an int. Unicode digit categories (Nd) include Arabic-Indic, Devanagari, and other numeral systems where mapping to int is non-trivial. Keeping it ASCII-only makes the return value unambiguous. For Unicode digit handling, use Parseff.Utf8.satisfy with a custom predicate.

`alphanum`

Parseff.Utf8.alphanum matches a Unicode alphabetic character or an ASCII digit. Combines Uucp.Alpha.is_alphabetic with the ASCII digit range.

`whitespace` and `skip_whitespace`

Parseff.Utf8.whitespace and Parseff.Utf8.skip_whitespace use the full Unicode White_Space property (Uucp.White.is_white_space). This includes ASCII whitespace plus:

NO-BREAK SPACE (U+00A0)
EN SPACE (U+2002), EM SPACE (U+2003)
IDEOGRAPHIC SPACE (U+3000)
and others

The ~at_least parameter on Parseff.Utf8.whitespace counts code points, not bytes.

(* Skip any Unicode whitespace before a value *)
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
  Uucp.Alpha.is_alphabetic
  ~label:"word"

`is_whitespace`

Parseff.Utf8.is_whitespace exposes the Unicode whitespace predicate for use with Parseff.Utf8.take_while or Parseff.Utf8.skip_while directly.

Invalid UTF-8

All UTF-8 primitives raise a parse error when they encounter an invalid byte sequence. This includes:

Bare continuation bytes (0x80--0xBF)
Invalid lead bytes (0xFE, 0xFF)
Overlong encodings
Truncated multi-byte sequences

The error message is "invalid UTF-8" and the position points to the first invalid byte.

Position tracking

Positions remain byte offsets, consistent with the rest of parseff. A single Parseff.Utf8.satisfy call advances the position by 1--4 bytes depending on the UTF-8 encoding of the matched code point. Parseff.position always returns a byte offset.

Streaming support

All UTF-8 primitives work with streaming input (Parseff.parse_source and Parseff.parse_source_until_end). Multi-byte UTF-8 sequences that span chunk boundaries are handled correctly, the streaming handlers ensure enough bytes are available before decoding each code point.

Mixing with byte-level primitives

satisfy

char

take_while

skip_while

take_while_span

skip_while_then_char