So what are we doing here?
Our aim is to write a simple parser from scratch and parse a number from a String. For this post let’s scope it down to positive numbers (unsigned integers).
The first question that arises is: why do we need to start from scratch? We could utilise existing libraries or regular expressions to achieve similar goals. There are times when delving deeper into the construction process is necessary for learning purposes.
In my first attempt to write a parser, I ended up writing my implementation top-down, writing implementations for things as I came across them and split them later. While the initial effort in doing it this way is lesser, the problems with testing these implementations became more and more complicated and I was repeating a lot of the effort. So what’s the alternative way? Can we build our parser from the ground up — reuse components?
There are multiple options for parsing and I won’t be diving into the choices (Recursive Descent Parsing, LL Parser, LR Parser, etc). For this post we are sticking to writing our parser using Parser Combinators.
Well, what do we need to make this happen?
What is the smallest block in the string we are trying to parse?
How do we build up blocks (ie) what operations/combinators do we need to make this happen?How do we interact with a parser?
How do we interact with a parser?
If we think about it, we are going to deal with some string, we recognize a pattern and we extract the information we are interested in.
What happens if there is more than what we need? For example — if the goal is to recognize a single digit from a string and we have a string — “1abc”, ideally we would want to split the string and get back ‘1’ and “abc”. In parsing we read characters till we get the token that we are interested in and return the remaining string (also known as Follow). If we get something else, we stop and return the follow.
Interface Parser<Token> {
function parse(value: String) -> (Token, String)
}
// Token is a generic data type that determines what the Parser returns
Parser<Token> = function (String) -> (Token, String)
// How you want to represent the parser will vary from language to language
Here (Token, String) is a tuple — a pair of values.
Our input is a String, output is a token that we are interested in along with the remaining string. A token is a single character that we are interested in.
Smallest block in a number
So what do we need to represent a number? A number is a combination of digits. Is that all we need?
Okay, so what’s a digit?
A digit can be represented as one of the characters from 0 to 9.
We have our smallest block — a single digit.
/ Here a Token is a parser that takes a single character as state.
// As a parser it is only interested in the one character
// that it is initialized with.
Token('1').parse("123") -> ('1', "23")
Token('2').parse("23") -> ('2', "3")
We have our happy path. What do we do if we don’t get the token we are interested in?
Token('1').parse("23") -> ?
Handling errors
Either we get a tuple of (Token, String) or we get an Error (check out Result type)
enum Result<T, E> {
Ok(T),
Err(E)
}
// a result has an Ok type and an error type.
// In this case T -> (Token, String)
// E -> String
This is what the Parser “interface” and return type would look like in Rust.
/ T is a generic that represent any data type
pub trait Parser<T> {
fn parse_from(&self, val: &String) -> ParseResult<T>;
}
pub type ParseResult<T> = Result<(T, String), String>;
Token('1').parse("123") -> Ok( ('1', "23") )
Token('1').parse("23") -> Error("Couldn't find 1")
Parsing a digit
We have 10 different digits we are interested in. How do we combine 10 different blocks?
This leads us up to our first operation/combinator!
A combinator is a higher order function that takes one or more parsers and returns a new parser.
function combinator1(p1: Parser<X>) -> Parser<Z>
function combinator2(p1: Parser<X>, p2: Parser<Y>) -> Parser<Z>
// Parser<X> is a function on (String) -> (X, String)
// Both 1 and 2 are valid combinators
// X, Y, Z can be any token or combination of tokens
How we choose to represent a combinator will change based on the language you choose, just like the Parser itself. The way I represented a combinator was as something that was a Parser — took one or more parsers and did some operation using these parsers.
Can a combinator take anything other than Parsers? A combinator can take other functions as input — it’s a higher order function after all.
OR
A digit can be represented as
digit_parser = Token('0') or Token('1') or .. or Token('9')
digit_parser = or([Token('0'), Token('1'), Token('2'), ..., Token('9')])
digit_parser = or(Token('0'), or(Token('1'), or(Token('2'), ....
// all of them are valid representations
// Example 1 uses 'or' as an operator
// Example 2 is a function that takes a list of parsers
// Example 3 is a function that takes 2 parser, so we end up chaning all 10
where or represents the OR combinator. So what we are saying here is that a digit is 0 or 1 or so on.
If we have a string “123”, what we want to do is read one character at a time and change that into a digit. So the first time we run the string through our parser we get (‘1’, “23”). The 2nd time gives us (‘2’, “3”) and the 3rd time gives us (‘3’, “”).
digit_parser = Token('1') or Token('2') ....
digit_parser.parse("123") -> Ok( ('1', "23") )
digit_parser.parse("23") -> Ok( ('2', "3") )
digit_parser.parse("abc") -> Err("Could not find a digit")
The general idea is if you find one of the tokens you are looking for, then you return that token and the follow. If none of them match then you return an Error.
This is what the operation would look like in Rust. We take two Parsers (doesn’t matter what the parser is interested in), and try to parse the input string. If the 1st parser is successful you would return Either::Left or if the 2nd parser is successful you return Either::Right. Error if neither matches. An Either is how we represent an OR here 🙂
pub struct OrParser<P1, P2>(P1, P2);
// Implement Parser<Either<U,V>> for OrParser
// The OrParser takes in two parsers as it's input.
// The implementation makes use of the parsers and
// returns the first valid result that is available
impl<P1, P2, U, V> Parser<Either<U, V>> for OrParser<P1, P2>
where
P1: Parser<U>,
P2: Parser<V>,
{
fn parse_from(&self, val: &String) -> ParseResult<Either<U, V>> {
self.0
.parse_from(val)
.map(|(val, follow)| (Either::Left(val), follow))
.or_else(|_| {
self.1
.parse_from(val)
.map(|(val, follow)| (Either::Right(val), follow))
})
}
}
Notice that what we are parsing is a digit character. We will need to convert these into an unsigned integer. How do we do it?
Map
What does a Map combinator do?
Before we dive into that, what does a map do in the normal world? A map is a higher order function that takes a Collection (list or a set) of items and a function. The function is applied on each item to produce an output.
The concept of a map is not limited to lists/sets and can be applied to other abstract containers.
map(Collection[U], function U -> V) -> Collection[V] // U is the input // The function transforms U to V // V is the output
list = [1, 2, 3, 4] double = x -> 2*x // function that doubles the value of input new_list = map(list, double) // [2, 4, 6, 8]
In the above example, both the input and output are a list of numbers. But all the items in the new list were doubled in value.
What does a Map combinator do?
Well it’s pretty much the same thing as our traditional map, but now instead of a Collection we have a Parser as a input.
// Parser(A) with a function(A) -> B gives us Parser(B)
// A and B can be any Token or combination of Tokens
function map(parser: Parser(A), mapper: function(A) -> B) -> Parser(B)
What we are looking for is mapping the character to a unsigned integer that represents the digit:
'0' -> 0
'1' -> 1
'2' -> 2
..
..
'9' -> 9
If we are representing the Digit as an enum of integer values from 0 to 9, then digit would have a method like:
enum Digit {
...
function from(value: Char) -> Digit;
// Digit is an enum of 0 to 9
}
The final digit parser:
raw_digit_parser = Token('1') or Token('2') ....
digit_parser = map(raw_digit_parser, Digit::from)
digit_parser.parse("123") -> Ok( (Digit(1), "23") )
digit_parser.parse("23") -> Ok( (Digit(2), "3") )
We can recognize a digit now! Onto the next hurdle — recognize all digits in a number 🙂
Zero or Many
The idea behind Zero or Many is pretty simple — if there are multiple instances of the same token — Zero or More -> parse them and return all of them. This is also popularly known as Many0 in combinator libraries.
Many0(Token('1')).parse("1112") -> Ok( (['1', '1', '1'], "2") )
Many0(digit_parser).parse("123") -> Ok( ([Digit(1), Digit(2), Digit(3)] , "") )
Since we are okay with 0 tokens being available in the string that is being parsed, a Many0 parser will never error out.
Many0(Token('1')).parse("234") -> Ok( ([], "234") )
Great, is it in line with what we need from a number?
Close, but not exactly — a number has at least one digit. Well technically a number has to start with a non-zero digit but for the purposes of this exercise let’s assume that prefixed 0’s will be ignored.
Before we think about this hurdle, let’s talk about AND. We already covered OR but what does AND mean in the context of a parser?
AND
Let’s say we want to parse one digit and an alphabet.
alphabet = Token('a') or Token('b') .... or Token('z') or Token('A') .. or Token('Z')
parser = digit and alphabet
// parser = and(digit, alphabet) is also a valid combinator
parser.parse("1a") -> Ok(('1', 'a'), "")
parser.parse(".$") -> Error("Could not find a digit")
So it’s the same as saying parse with a digit, get the remaining string and run it through the 2nd parser. This gives us two tokens and the remaining string.
This is how I represented an AND for two parsers in Rust.
A tuple of parser represents an AND and the idea is to pass the follow from the first parse operation onto the 2nd parser. If there are failures along the way, there is an early exit and the parser returns Error whenever it can.
impl<P1, P2, X, Y> Parser<(X, Y)> for (P1, P2)
where
P1: Parser<X>,
P2: Parser<Y>,
{
fn parse_from(&self, val: &String) -> ParseResult<(X, Y)> {
self.0.parse_from(val).and_then(|(x, follow)| {
self.1
.parse_from(&follow)
.map(|(y, remaining)| ((x, y), remaining))
})
}
}
One or Many
As the name suggests, the parser expect atleast one instance of the token to be available in the string. This is commonly known as a Many1 Parser.
Okay, can we now think of a way to represent a Many1 with whatever we have so far?
Many1(parser) = parser and Many0(parser)
If we have atleast one token, then the first parser works, Many0 returns the remaining. If there are zero tokens, then the first parser fails and errors out.
many1_digit = Many1(digit)
many1_digit.parse("123") -> Ok( (('1', ['2', '3']), ""))
many1_digit.parse("abc") -> Error("Digit not found")
Note that a Many1 based on our currently implementation provides a tuple of (Token, List[Token]) and what we need is a single list. We would ideally want to flatten this to just one List[Token].
So here we want a Function that takes in a tuple of (Token, List[Token]) and return us List[Token]
concat = ( token: Token, list : List[Token] ) -> [token].concat(list)
// A-> (Token, List[Token])
// B-> (List[Token])
digits = map(many1_digit, concat)
digit.parse("123") -> Ok( ([Digit(1), Digit(2), Digit(3)], "") )
Putting everything together
The final piece we need is a way to “map” a list of digits into a number.
['1', '2', '3'] -> 123
How would we go about doing that?
123 -> 100 * 1 + 2 * 10 + 3 * 1
For a list of digits, the map function we are looking for would look something like this:
reduce_digits = (digits) -> digits.reduce(0, (acc, d) -> acc*10 + d)
Given some digits, reduce it in a way where we start with our state as 0 and keep multiplying it by 10 (radix) and add the digit.
[1, 2, 3]
1 -> 0*10 + 1 -> 1
2 -> 1*10 + 2 -> 12
3 -> 12*10 + 3 -> 123
The final parser looks something like this:
raw_digit_parser = Token('0') or Token('1') or Token('2') or .. or Token('9')
digit_parser = map(digit_parser, Digit::from)
concat = ( token: Token, list : List[Token] ) -> [token].concat(list)
digits_parser = map(many1(digit_parser), concat)
reduce_digits = (digits) -> digits.reduce(0, (acc, d) -> acc*10 + d)
number_parser = map(digits_parser, reduce_digits)
number_parser.parse("123") -> Ok( (123, "") )
number_parser.parse("123abc") -> Ok( (123, "abc") )
Conclusion
While it might seem like a lot of work to parse something as simple as a number, the underlying idea is that once you have all the smaller pieces in place, creating a parser on top of the pieces is as simple as declaring them.
Think about this:1If we wanted to move onto parsing a negative number — what is the effort to get that done?
If we wanted to move onto parsing a negative number — what is the effort to get that done?
Parse an integer
a floating point number
I’d like to extend a special shoutout to Sudarsan, whose guidance and enthusiasm introduced me to the fascinating world of parser combinators. Here’s to the mentors who ignite our curiosity and propel us forward in our exploration of new concepts and technologies.
Code sample for the Number parser in Rust
Nom — a Rust combinator library
Fparsec — F# combinator library
Building a useful set of parser combinators