A formal language, consisting of a set of instructions and used to implement programs.
A set of instructions used to control the behaviour of a machine.
Non human language.
A set of strings and symbols together with a set of rules.
Most often the tokens are defined with regular grammar over the alphabet.
The rules for composing sentences from words.
Most often the sentences are defined with context free grammar over the tokens.
Takes a program written in a high-level programming language and translates it to machine language.
Compiler + linker = executable
Three-stage compiler structure
Understands the program?
Builds an Abstract Syntax Tree (AST) of the program.
Recognizes all the tokens in the program. Most often that involves regular expressions and finite automata.
Recognizes the rules from stream of tokens and builds the AST of the program.
The grammar of a languages defines the structure of correct sentences and how to derive their meaning.
Context Free Grammars
<expr> := <expr> <op> <expr> | (<expr>) | <term>
<op> := + | - | * | /
<term> := [0-9]+
Parser Expression Grammars are similar to CFG, but are more convenient for
parsing, since the |
operator is not ambiguous.
<expr> := <sum>
<sum> := <prod> ([+-] <prod>)*
<prod> := <value> ([*/] <value>)*
<value> := [0-9]+ / '(' <expr> ')'
Where the else goes?
<if> := if <expr> <stmnt> else <stmnt>
| if <expr> <stmnt>
if x0 if x1 s1 else s2
if x0 { if x1 s1 } else s2 // 1
if x0 { if x1 s1 else s2 } // 2
LR(n) and LALR(n) are family of parsing algorithms for CFG
function answer() {
return 6*7;
}
function
answer
(
)
{
return
*
;
}
https://esprima.org/demo/parse.html?code=function%20answer()%20%7B%0A%20%20%20%20return%206%20*%207%3B%0A%7D
{
"type": "FunctionDeclaration",
"id": { "type": "Identifier", "name": "answer" },
"body": {
"type": "BlockStatement",
"body": [
{
"type": "ReturnStatement",
"argument": {
"type": "BinaryExpression",
"operator": "*",
"left": { "type": "Literal", "value": 6 },
"right": { "type": "Literal", "value": 7 }
}
}
]
},
}
Takes the AST and transforms that to some Intermediate Representation (IR) that is convenient for:
int mul_add(int x, int y, int z) {
return x * y + z;
}
define i32 @mul_add(i32 %x, i32 %y, i32 %z) {
entry:
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
ret i32 %tmp2
}
WebAssembly is actually an IR.
(module
(func (export "add") (param $n1 i32) (param $n2 i32) (result i32)
get_local $n1
get_local $n2
i32.add
)
)
C#, F#, Basic (some versions), etc target the .NET virtual machine
Common Intermediate Language
.class public Foo
{
.method public static int32 Add(int32, int32) cil managed
{
.maxstack 2
ldarg.0 // load the first argument;
ldarg.1 // load the second argument;
add // add them;
ret // return the result;
}
}
Takes the IR and produces native code for a particular machine
Like the compiler, but instead of producing machine code, executes the program instruction by instruction translating from program instructions to machine instructions.
Takes program written in one language and translates that to another language.
Lots of tools can be created based on the AST representation of a program
Takes machine code and executes that on the real machine.
Typically the VM implements most of the language features like automatic memory collection and most of the standard library of the language.
Calculating an expression using postfix notation is actually an interpreter for expressions.
vector<Token>
vector<Token>
-> AST