Recognizes the rules of the language and builds the Abstract Syntax Tree (AST) of the program from the stream of tokens.
The grammar of a languages defines the structure of correct sentences and how to derive their meaning.
Context Free Grammars are:
<expr> ::= <expr> <op> <expr> | (<expr>) | <term>
<op> ::= + | - | * | /
<term> ::= [0-9]+
Does this grammar allow for operator precedence? (No, more later)
The grammar has two alphabets:
In the expression grammar:
Notation for describing context-free grammars.
<expr> ::= <expr> <op> <expr> | "(" <expr> ")" | <term>
<op> ::= "+" | "-" | "*" | "/"
<term> ::= <digit> | <digit><term>
<digit>::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
Extended BNF for more compact represenation of grammars. There are different specifications for EBNF, but they have the same power, just different syntax.
https://www.w3.org/TR/REC-xml/#sec-notation
expr = expr , op , expr | "(" , expr , ")" | term
op = "+" | "-" | "*" | "/"
term = { digit }
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
Parser Expression Grammars are similar to CFG, but are more convenient for parsing, since the choice operator is not ambiguous.
/
(not |
) and it is not commutative<expr> ::= <sum>
<sum> ::= <prod> ([+-] <prod>)*
<prod> ::= <value> ([*/] <value>)*
<value> ::= [0-9]+ / '(' <expr> ')'
Where the else
goes?
<if> ::= if <expr> <stmnt> else <stmnt>
/ if <expr> <stmnt>
if x0 if x1 s1 else s2
if x0 { if x1 s1 } else s2 // 1
if x0 { if x1 s1 else s2 } // 2
function answer() {
return 6*7;
}
function
answer
(
)
{
return
*
;
}
The tree that starts from the grammar inital non-terminal and generates a string in the language.
AST differ from CST because superficial distinctions of form, unimportant for translation, do not appear in AST.
We will be focusing solely on AST, since it is used for translation and it easy to skip the CST and generate directly AST.
Most implementations use a common base class for each Node type and a derived class for each specific node
struct Node { virtual ~Node() = 0; };
typedef std::unique_ptr NodePtr;
for
cyclestruct ForNode : Node {
virtual ~ForNode() override = default;
NodePtr m_Init;
NodePtr m_Condition;
NodePtr m_Increment;
NodePtr m_Body;
};
virtual
method for each operation?switch
for each node type?No, adding a new algorithm or a node will be difficult in both cases.
Represent an operation to be performed on elements of an object structure.
Visitor lets you define a new operation without changing the classes of the elements on which it operates.
struct NodeVisitor {
virtual ~NodeVisitor() = 0;
virtual Visit(IfNode& node) = 0;
virtual Visit(ForNode& node) = 0;
// etc
};
struct NodeVisitor;
struct Node {
virtual ~Node() = 0;
virtual void Visit(NodeVisitor& visitor) = 0;
};
for
cyclestruct ForNode : Node {
virtual ~ForNode() override = default;
virtual void Visit(NodeVisitor& visitor) override {
visitor.Visit(*this);
}
NodePtr m_Init;
NodePtr m_Condition;
NodePtr m_Increment;
NodePtr m_Body;
};
Macro iterators
Variadic Templates
C++Future - metaclasses
There a lot of algorithms for parsing grammars, with different time / memory tradeoffs
The algorithms can be:
Most of the algorithms require making the grammar follow a specific form and then explain how to create a parser for the language.
These are the major grammar forms and parsing algorithms. While they are not exactly the same in terms of algorithms and power.
Disclaimer: It is possible to write every parser manually, but we’ll be discussing:
When we get to generated parsers, will be discussing bottom-up parsers as well.
How to write a hand-crafted parser?
for each rule that has E as a symbol on the left, make a case in the function that parser the rule.
if
and match(XXX)
or check whether a subexpression
has been parsed.*
(repeats) are translated to while (match(XXX))
require(XXX)
Where XXX
is a token (aka terminal).
<expr> ::= <sum>
<sum> ::= <prod> ([+-] <prod>)*
<prod> ::= <value> ([*/] <value>)*
<value> ::= [0-9]+ / '(' <expr> ')'
The higher the precedence -> the lower in the tree
Lower precedence non-terminals generate higher ones
<sum> ::= <prod> ([+-] <prod>)*
<prod> ::= <value> ([*/] <value>)*
Left associative operators go deep (loop) on the left
Right associative operators go deep (loop) on the right
Or can be handled in the parser
<sum> ::= <prod> [+-] <prod> | <sum> [+-] <prod>
// <expr> ::= <sum>
AST parse_expr() {
return parse_sum();
}
// <sum> ::= <prod> ([+-] <prod>)*
AST parse_sum() {
AST left = parse_prod();
auto token = next_token();
while (token == '+' || token == '-') {
AST right = parse_prod();
left = make_op(token, left, right);
}
return left;
}
// <prod> ::= <value> ([*/] <value>)*
AST parse_prod() {
AST left = parse_value();
auto token = next_token();
while (token == '*' || token == '/') {
AST right = parse_value();
left = make_op(token, left, right);
}
return left;
}
// <value> ::= [0-9]+ / '(' <expr> ')'
AST parse_value() {
auto token = next_token();
if (token == NUMBER) {
return make_number(token);
}
else if (token == '(') {
AST expr = parse_expr();
require_token(')');
return expr;
}
throw error;
}
int precedence(token) {
switch (token) {
case EOF: return 0;
case '(': return 2;
case ')': return 5;
case '+':
case '-': return 10;
case '*':
case '/': return 20;
}
}
AST parse_expr() {
stack<AST> output;
stack<Token> operators;
do {
auto token = next_token();
handle_token(token, output, operators);
} while (token != EOF);
AST result = output.top();
output.pop();
assert(output.empty());
return result;
}
void handle_token(token, output, operators) {
if (token == NUMBER) {
output.push(make_number(token));
} else {
handle_operator(token, output, operators);
}
}
void handle_operator(token, output, operators) {
auto prec = precedence(token);
while (!operators.empty() &&
(precedence(operators.top()) >= prec)) {
output_tree(operators.top());
operators.pop();
}
if (token == ')') {
assert(operators.top() == '(');
operators.pop();
} else if (token != EOF) { // '(' or an operator
operators.push(token);
}
}
void output_tree(Token operator, output) {
auto rhs = output.top();
output.pop();
auto lhs = output.top();
output.pop();
output.push(make_op(operator, lhs, rhs));
}
Most hand-written parsers use combination of recursive descent and operator precendence for expression for better performance when parsing expressions.
Extend the xxx2html tool to:
Do only one of these.