In many cases it is appropriate to separate parsing and lexing. A lexer breaks up the input stream into tokens like identifiers, parentheses, numbers, strings etc. Furthermore usually the lexer strips off whitespace. The parser handles the grammar of the language by using the tokens as primitives.
This approach has several advantages:
However many combinator libraries do not offer the possibility to split up the parsing task into a lexer and a parser. `Fmlib_parse` supports the splitting up of lexing and parsing with a lot of functionality.
A lexer analyzes the input stream consisting of characters in the following way:
WS Token WS Token WS .... WS EOF
where WS is a possibly empty sequence of whitespace like blanks, tabs, newlines, comments etc. Token is a lexically correct token. EOF represents the end of the input stream.
Since the lexer has to succeed immediately after recognizing a syntactically correct token it is not a normal parser which succeeds only after having seen the end of input. Therefore a lexer is a partial parser. After having successfully recognized a token the lexer must be restartable to recognize the next token or to recognize the end of input.
The easiest way to write a lexer with the help of Fmlib_parse is to use Fmlib_parse.Character by doing the following steps:
Define a module Token and Token_plusย of the following form:
module Token = struct
type t =
T1 of ...
T2 of ...
...
End (* end of input *)
...
end
module Token_plus = struct
type t = Position.range * Token
endwhitespace which recogizes zero or more occurrences of whitespace. The definition of whitespace depends on the language.tok1, tok2, ...Use Fmlib_parse.Character.Make.lexer with the definition
let token: Token_plus.t t =
lexer
whitespace
Token.End
(
tok1 </> tok2 </> tok3 </> ...
)to have a combinator which recognizes tokens and strips off whitespace.
Fmlib_parse.Character.Make.make_partial and Fmlib_parse.Character.Make.restart_partial to make the lexer satisfying the interface Fmlib_parse.Interfaces.LEXERLook into https://github.com/hbr/fmlib/blob/master/src/parse/test_json.ml to see an example with a simple json parser on how it works.
Write the parser using Fmlib_parse.Token_parser to write the parser which uses Token.t as the primitive tokens. Look into the same example as above.
Use Fmlib_parse.Parse_with_lexer to generate the final parser which scans a stream of characters breaks the input up into tokens by using the lexer and analyzes the grammar by using the token parser. See same example as above.