Lexical Analysis
Overview of lexers and how to use them
Lexical Analysis
Converting the sequence of characters into a sequence of tokens for the language grammar.
Theory
Programming languages have a finite set of keywords and rules for defining numbers and identifiers.
These are most easily modeled as a regular language, so you can use regex and DFA to recognize the tokens.
A short refresher on RE
- a matches a
- ab matches ab
- a|b matches a or b
a*
matches a zero or more times- a? matches a or empty
- [0-9] matches any digit
- a{m,n} matches between m and n as
What we can do with regular expressions
- We can define regular language.
- We can recognize words from regular language.
- We can find patterns in sequence of characters.
What we can’t do with regular expressions
- We can’t recognize a{n}b{2n}
- We can’t recognize if something is valid HTML
- We can’t recognize if something is valid (place here almost every programing language)
Token
A token is a single terminal symbol for the language grammar.
Tokens have type and value - i.e. 3.14
is a number and has value 3.14
Tokens
Typical tokens are:
- keyword - if, else, for, while, do, function
- number - integer - 42, double - 3.14, float - 3.14f
- string
- operator -
==
,+
, - symbol -
(
,)
,- it might be easier to split these into LPAREN, RPAREN.
- identifier - starts with a [a-zA-Z_]. Why?
Because
it is easy to distinguish between identifiers, numbers, strings and symbols by looking at the first character
How to implement a token?
What goes into a token
- type
- value
- file, line and column
Token Type
enum TokenType
{
LeftParen,
RightParen,
// ...
While,
// ...
True,
False,
Eof,
};
Token struct
struct Token
{
TokenType Type;
unsigned Line;
IPLString Lexeme;
double Number = 0.0;
};
double + string ~ wasteful
The token is either a string or a number, but we are keeping both, so it is somewhat wasteful.
Keeping data in token
std::variant<double, IPLString
will require C++17union
does’t work good with strings- non-trivial destructor inside the
union
- non-trivial destructor inside the
union
withstring_view
over the source contents -> C++17- this means we have to keep the contents of the source alive as long as any token is still alive
- tokens can just keep strings and leave the parser to turn them into numbers
Lexer
The lexer is the component of the compiler that does lexical analysis.
How a lexer works
Reads the source character by character and tries to match that against the regexes that define the language.
- Generally the longest match is preferred, so that
>=
is preferred over>
.
How to write a lexer
- generate it with a tool
- write one by hand
- for an existing language
- search in google: “Lexical grammar for language you prefer”
- find an open-source implementation of the language
- it might be in a format ready for some tool
- for a custom language write down on paper the lexical grammar
- define your keywords
- rules for identifiers, numbers, strings, …
- …
Generate a lexer
Tools use finite automata to recognize the tokens. When a token is recognized, an user action code is executed, so that the type and the value of the token can be stored.
Tools
- flex - uses tables for the states of the automaton
- re2c - generates code for the automaton in C
- claims easier debuggability, but still the generated code is pretty complex
- some parser generators have a built-in lexer as well
Flex
definitions
%%
rules
%%
user code
%{
/* include verbatim code */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf("An int:%s:%d\n", yytext, atoi(yytext));
}
{DIGIT}+"."{DIGIT}* {
printf("A float:%s:%g\n", yytext, atof(yytext));
}
if|then|while|do|for|function {
printf("A keyword:%s\n", yytext);
}
{ID} printf("An identifier:%s\n", yytext);
Input
https://github.com/SofiaCPP/IPL/blob/master/demo/flex/JavaScript.flex
Output
https://github.com/SofiaCPP/IPL/blob/master/demo/flex/lex.yy.c
Disclaimer
Default options used!
flex
is quite customizable, either by options, or by defines for the file.
We have used the default options for simplicity.
re2c
- re2c.org
- https://github.com/SofiaCPP/IPL/tree/master/spasm/src/asm
- lexer.re
- dfa.svg
Debugging tool generated code
- pretty much impossible in the generated code
- the tool is created to allow defining the lexer in high-level language
Debug the high-level language definition, not the generated code
Writing a lexer by hand
- Start with the lexical grammar
- Read input symbol by symbol, for each symbol follow rules that are still possible
- Continue reading until no more input is possible -> emit the last possible token
Writing a lexer by hand
For our sample lexical grammar:
while (c != '\0') {
if (isdigit(c)) {
emit_number();
}
else if (c >= 'a' && c <= 'z') {
auto word = read_word();
if (keyword(word)) {
emit_keyword(word);
} else {
emit_identifier(word);
}
}
else { error(); }
}
Keyword vs identifier
The set of keywords is fixed and finite.
- it is possible to create a perfect hash
- gperf is a tool for that
- use a good data-structure that will allow for very fast look-up
Demo for gperf
https://github.com/SofiaCPP/IPL/tree/master/demo/gperf
Note
Some languages do not have/need grammar for compilation
- assembler languages
- Languages based on reverse or prefix notation
- Forth
Lexer in JSImpl
Lexer.cpp
- by hand
std::unordered_map
to recognize keywords
Homework
Create a xxx2html syntax highlighter for a language of your choice.