Programming Languages, Compilers, Interpreters
Overview of compiler and interpreter architecture.
The dragons
- Programming languages
- Architectures of
- compilers
- interpreters
- virtual machines
Programming languages
A formal language, consisting of a set of instructions and used to implement programs.
Program
A set of instructions used to control the behaviour of a machine.
Formal language
Non human language.
A set of strings and symbols together with a set of rules.
String and symbols
- The language has alphabet and words.
- Also known as tokens.
Most often the tokens are defined with regular grammar over the alphabet.
- regular grammars
- regular expressions
- finite automata
Rules
The rules for composing sentences from words.
Most often the sentences are defined with context free grammar over the tokens.
Compiler
Takes a program written in a high-level programming language and translates it to machine language.
Building an executable
Compiler + linker = executable
Compiler
- Takes a program in a high-level language
- Understands the program meaning
- Generates a machine language program with the same meaning
Compiler architecture
Three-stage compiler structure
- Front end
- understands the program
- Middle
- does the optimizations and removes syntactic sugar
- Back end
- generates the machine code
Front end
Understands the program?
Builds an Abstract Syntax Tree (AST) of the program.
- Converts source to stream of token.
- Lexer
- Converts the stream of tokens to a AST.
- Parser
Lexer
Recognizes all the tokens in the program. Most often that involves regular expressions and finite automata.
- hand-written lexer
- lexer generation tool - takes a list of regular expressions and code to execute
Parser
Recognizes the rules from stream of tokens and builds the AST of the program.
- hand-written parser
- parser generation tool
Grammars
The grammar of a languages defines the structure of correct sentences and how to derive their meaning.
CFG
Context Free Grammars
<expr> := <expr> <op> <expr> | (<expr>) | <term>
<op> := + | - | * | /
<term> := [0-9]+
PEG
Parser Expression Grammars are similar to CFG, but are more convenient for
parsing, since the |
operator is not ambiguous.
- CFG allows selecting any matching variant and can have more than one parse
tree.
- parsing algorithms try to resolve this
- PEG prioritizes the variants in the order that they are written.
PEG
<expr> := <sum>
<sum> := <prod> ([+-] <prod>)*
<prod> := <value> ([*/] <value>)*
<value> := [0-9]+ / '(' <expr> ')'
- That is a CFG as well, so for some grammars there is no difference.
- Some languages can be expressed only with ambiguous grammars.
CFG vs PEG
Where the else goes?
<if> := if <expr> <stmnt> else <stmnt>
| if <expr> <stmnt>
if x0 if x1 s1 else s2
if x0 { if x1 s1 } else s2 // 1
if x0 { if x1 s1 else s2 } // 2
- CFG - it is ambiguous
- PEG - 2 - because it is the first option
LALR(n)
LR(n) and LALR(n) are family of parsing algorithms for CFG
- n is the number of look ahead terminals
- L - left-to-right,
- R - right-most derivation - the right most non-terminal is replaced
Program
function answer() {
return 6*7;
}
Tokens
- keyword:
function
- identifier:
answer
- symbol:
(
- symbol:
)
- symbol:
{
- keyword:
return
- number: 6
- symbol:
*
- number: 7
- symbol:
;
- symbol:
}
AST
https://esprima.org/demo/parse.html?code=function%20answer()%20%7B%0A%20%20%20%20return%206%20*%207%3B%0A%7D
{
"type": "FunctionDeclaration",
"id": { "type": "Identifier", "name": "answer" },
"body": {
"type": "BlockStatement",
"body": [
{
"type": "ReturnStatement",
"argument": {
"type": "BinaryExpression",
"operator": "*",
"left": { "type": "Literal", "value": 6 },
"right": { "type": "Literal", "value": 7 }
}
}
]
},
}
Middle
Takes the AST and transforms that to some Intermediate Representation (IR) that is convenient for:
- optimization - constant propagation, loop unrolling, etc
- machine code generation - register allocation, etc
IR
- A compiler might have different levels and types of IR
- text and binary
- Static Single Assignment form
IR in LLVM - C
int mul_add(int x, int y, int z) {
return x * y + z;
}
IR in LLVM - IR
define i32 @mul_add(i32 %x, i32 %y, i32 %z) {
entry:
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
ret i32 %tmp2
}
WebAssembly
WebAssembly is actually an IR.
(module
(func (export "add") (param $n1 i32) (param $n2 i32) (result i32)
get_local $n1
get_local $n2
i32.add
)
)
.NET CIL
C#, F#, Basic (some versions), etc target the .NET virtual machine
Common Intermediate Language
.class public Foo { .method public static int32 Add(int32, int32) cil managed { .maxstack 2 ldarg.0 // load the first argument; ldarg.1 // load the second argument; add // add them; ret // return the result; } }
Back end
Takes the IR and produces native code for a particular machine
Interpreter
Like the compiler, but instead of producing machine code, executes the program instruction by instruction translating from program instructions to machine instructions.
Interpreter
- Most interpreters have a REPL which makes them great for experiments.
- You don’t need build an executable to run your program
Compiler vs Interpreter
AOT compilation
- Ahead of Time - the program is compiled before it is being run
JIT compilation
- Just in Time - the program is running (typically in an interpreter) and parts of it are being compiled to native machine code while it is being executed
AOT vs JIT
- AOT - less overhead, faster startup, generally better performance
- JIT - better chance for optimizations, since some of the program arguments will be constants once the program has started, some of the virtual calls can be devirtuallized at runtime
Transpiler
Takes program written in one language and translates that to another language.
- cfront
- CoffeScript, TypeScript
- nim
Code transformation tools
Lots of tools can be created based on the AST representation of a program
- code formatting - clang-format, gofmt, autopep8
- code transformation - clang-modernize
- code analysis - clang-tidy
Virtual Machine
Takes machine code and executes that on the real machine.
Typically the VM implements most of the language features like automatic memory collection and most of the standard library of the language.
VMs
- stack based
- all computations are done with operands pushed onto a stack, then popped and the result is pushed back
- simple, but inefficient
- register based
- some of the computations are using registers (generally each local variable gets a register)
- code generation is a bit harder, but the machine is more efficient
RPN as an interperter
Calculating an expression using postfix notation is actually an interpreter for expressions.
- Forth is a complete programming language based on that.
What is a machine?
- x86 is a virtual machine
- the CPU is an interpreter with microcode
Tools
- lex / flex
- re2c
- yacc / bison
- lemon
- antlr
Crash course in JSImpl
- repository organization
- continous integration
- pull requests
Crash course in JSImpl
- projects
- project generation
- allocators
- tests
Architecture of JSImpl
- frontend
- Lexer - text ->
vector<Token>
- Parser -
vector<Token>
-> AST
- Lexer - text ->
- middle
- AST
- JSON
- Bytecode
- backend
- spasm