Converting the sequence of characters into a sequence of tokens for the language grammar.
Programming languages have a finite set of keywords and rules for defining numbers and identifiers.
These are most easily modeled as a regular language, so you can use regex and DFA to recognize the tokens.
a*
matches a zero or more timesA token is a single terminal symbol for the language grammar.
Tokens have type and value - i.e. 3.14
is a number and has value 3.14
Typical tokens are:
==
, +
,(
, )
,it is easy to distinguish between identifiers, numbers, strings and symbols by looking at the first character
enum TokenType
{
LeftParen,
RightParen,
// ...
While,
// ...
True,
False,
Eof,
};
struct Token
{
TokenType Type;
unsigned Line;
IPLString Lexeme;
double Number = 0.0;
};
The token is either a string or a number, but we are keeping both, so it is somewhat wasteful.
std::variant<double, IPLString
will require C++17union
does’t work good with stringsunion
union
with string_view
over the source contents -> C++17The lexer is the component of the compiler that does lexical analysis.
Reads the source character by character and tries to match that against the regexes that define the language.
>=
is preferred over >
.Tools use finite automata to recognize the tokens. When a token is recognized, an user action code is executed, so that the type and the value of the token can be stored.
definitions
%%
rules
%%
user code
%{
/* include verbatim code */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf("An int:%s:%d\n", yytext, atoi(yytext));
}
{DIGIT}+"."{DIGIT}* {
printf("A float:%s:%g\n", yytext, atof(yytext));
}
if|then|while|do|for|function {
printf("A keyword:%s\n", yytext);
}
{ID} printf("An identifier:%s\n", yytext);
https://github.com/SofiaCPP/IPL/blob/master/demo/flex/JavaScript.flex
https://github.com/SofiaCPP/IPL/blob/master/demo/flex/lex.yy.c
Default options used!
flex
is quite customizable, either by options, or by defines for the file.
We have used the default options for simplicity.
Debug the high-level language definition, not the generated code
For our sample lexical grammar:
while (c != '\0') {
if (isdigit(c)) {
emit_number();
}
else if (c >= 'a' && c <= 'z') {
auto word = read_word();
if (keyword(word)) {
emit_keyword(word);
} else {
emit_identifier(word);
}
}
else { error(); }
}
The set of keywords is fixed and finite.
Some languages do not have/need grammar for compilation
Lexer.cpp
std::unordered_map
to recognize keywordsCreate a xxx2html syntax highlighter for a language of your choice.