Weekend compiler: symbol tables

Weekend compiler: symbol tables

We continue our evening concert at the request of radio listeners. The topic of today’s conversation is symbol tables. I remind you that in the past we talked about syntactic trees and the way to build them from the source of the language I invented wend (abbreviation of week-end). Here is a summary of the series of articles:

  1. Syntax trees and a naive python translator

  2. Lexer/parser

    2′. Cursed fire, or the magic of the C preprocessor

  3. Symbol tables: variable scopes and type checking (this article)

  4. Stack and translator to python without using python variables

  5. Translator to assembler

  6. Raytracing 🙂

In general, the target language is assembly language, but for now, as an intermediate result, I generate code in python. At the moment, all code generation is just pretty print superimposition over the syntax tree. I remind you that at the moment the “compilation” process looks like this:

A lexer converts a stream of source code symbols into a stream of tokens, a parser parses them, building a syntactic tree. And then, by simply traversing the tree in depth, I output the code in the target language. In simple cases this works well, but at the end of the last article I specifically left out a few cases where the compiler breaks. Let’s recall one of them, on the left is the source on wend, on the right is an incorrect translation into python:

I expect the code to display 3, while python shows me 0. Why? Let’s understand each other. For starters, let’s put wend aside and just talk about python.

Scopes of variables in Python

I will not open any America, but I was surprised to find that quite a number of people do not know how variable binding works in this language. Let’s look at the simplest example: I have a function foo()which displays the value of the variable on the screen barwhich is defined outside the function:

ssloy@home:~$ python3 <<<'
def foo():
    print(bar)

bar = 0
foo()
print(bar)
'
0
0

As before, I now provide both the code and the result of its execution. A language with implicit variable assignment would be expected to find a global variable bar and will display two zeros. And what if I try not only to output values barbut also write a unit in it?

ssloy@home:~$ python3 <<<'
def foo():
    print(bar)
    bar = 1

bar = 0
foo()
print(bar)
'
Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "<stdin>", line 3, in foo
UnboundLocalError: cannot access local variable 'bar' where it is not associated with a value

And we had a compilation error, and it was a compilation error, and it did not fail during execution. If, inside the function, the assignment and display operations are rearranged, then everything starts perfectly, and the value of the global variable is clearly bar does not change:

ssloy@home:~$ python3 <<<'
def foo():
    bar = 1
    print(bar)

bar = 0
foo()
print(bar)
'
1
0

This is not magic at all, it is completely normal, and at the same time extremely counterintuitive for beginners who came to Python from languages ​​with stricter variable declarations. Python, like most other languages, divides variables into local and global, and the context is global for reading, and local for writing. So it turns out that if we first try to do print(bar)and then appropriate bar = 1then the python knows that bar – a local variable, but it was still initialized, and it breaks. You can use the word globalto clearly state that bar must be a global variable despite being written to:

ssloy@home:~$ python3 <<<'
def foo():
    global bar
    bar = 1
    print(bar)

bar = 0
foo()
print(bar)
'
1
1

My practice shows that usually people limit themselves to two keywords global/localforgetting one more extremely interesting thing: nonlocal. Let’s look at the following code example:

ssloy@home:~$ python3 <<<'
def counter():
    count = -1
    def increment():
        nonlocal count
        count += 1
        return count
    return increment

counter1 = counter()
counter2 = counter()

for _ in range(3):
    print("outer counter:", counter1())
    for _ in range(2):
        print("  inner counter:", counter2())
'
outer counter: 0
  inner counter: 0
  inner counter: 1
outer counter: 1
  inner counter: 2
  inner counter: 3
outer counter: 2
  inner counter: 4
  inner counter: 5

There is a function here counter() with nested function increment()with increment() refers to a variable countwhich is neither local to increment(), no global. Changeable count is local to the function counter()and increment() refers to a local environment variable.

I created two counters counter1 and counter2they can count independently of each other because they refer to two different instances of the variable count. This magic is called closure and it’s an extremely powerful tool to use, but be careful not to break the brains of those who will be maintaining your code 🙂

Back to our rams, this is what a correct translation from wend to python should look like, notice the appearance of two lines with the keyword nonlocal:

Christmas tree toys

Now let’s figure out how to teach the compiler to add such information. We need to modify the compilation process by adding a single semantic analysis step that will be called directly after the syntax tree is built:

The parser grows a Christmas tree for us, and the semantic analyzer hangs Christmas tree toys on it (by the way, this is not a joke, compilers really use such jargon). Well, then the code generation in the target language comes from a decorated Christmas tree. This is what the syntax tree for our example looks like after passing the semantic analyzer:

Note the information marked in red, it was added by the analyzer. The most convenient place to store it is a dictionary decoI predicted for each node a tree. So far I’ve been storing any type of line number in the source file there for error signaling, but now it’s useful for semantic analysis.

I hope it is generally clear what information we need to add to the tree. But how to do it? And this is where symbol tables come to the rescue.

Tables of symbols

Let me draw an animation of the work of a primitive semantic analyzer:

I need a data structure into which I will incrementally add (and remove!) information about variable scopes. I will do this by bypassing the syntactic tree in depth. For the compactness of the display, my animation does not draw the syntactic tree itself, but the source code, but it must be understood that it has long been discarded at this point, and I am working with the tree. It’s just that traversing the lines of the source code is very easy to draw, and it exactly corresponds to traversing the syntax tree in depth.

In my language, scopes correspond exactly to functions, so starting at the root of the tree, I open a new variable scope: push_scope(main). Then I add all the local variables to my table: add_var(x). In my animation, I’ve sketched out the variable type as it’s still looming in the code, but I don’t need it at the moment, I’ll look at it later when I do some type checking.

The next node in my syntax tree is a nested function f. We open a new scope by creating a nested table: push_scope(f)and add all local variables to it, there is only one argument add_var(y). The same happens with the function g: push_scope(g), add_var(y).

And here is the most interesting thing: we meet the node Assignwhich matches the string x = y + 1. In our syntax tree about the variable x we only know her ID, just a string, nothing more. Let’s learn more about her: find_var(x). We know we’re at the third nesting level, so let’s see if the current scope has a record about x.? No, there is not. On the second level? Neither is there. On the third? Yes, I found it! So we can tell the second and third visibility blocks that they use a non-local variable x.

And now (for the future) we have found the type of the variable, and now we can check whether the type of the variable in the assignment statement matches: we know what is going on ArithOp, it can be an integer. If x has a different type, it’s time to crash.

We have dealt with the functionality of the symbol table, let’s move on to the implementation. I did very primitively:

class SymbolTable():
    def __init__(self):
        self.variables = [{}]     # stack of variable symbol tables
        self.ret_stack = [ None ] # stack of enclosing function symbols, useful for return statements

    def add_var(self, name, deco):
        if name in self.variables[-1]:
            raise Exception('Double declaration of the variable %s' % name)
        self.variables[-1][name] = deco

    def push_scope(self, deco):
        self.variables.append({})
        self.ret_stack.append(deco)

    def pop_scope(self):
        self.variables.pop()
        self.ret_stack.pop()

    def find_var(self, name):
        for i in reversed(range(len(self.variables))):
            if name in self.variables[i]:
                return self.variables[i][name]
        raise Exception('No declaration for the variable %s' % name)

I store the nested scopes as a list of dictionaries whose keys are IDs and whose values ​​are decoration deco corresponding syntactic tree node (y deco the variable type is stored). On entering the scope, I add a dictionary (push_scope), when exiting, delete (pop_scope). Well, in parallel, I keep the same list of decorations from function nodes, which will allow the parser to add information about non-local variables to the tree.

Here’s what the semantic parser code looks like, it’s just a primitive depth-first tree traversal that queries the symbol table every time it encounters a variable:

Hidden text
from syntree import *
from symtable import *

def build_symtable(ast):
    if not isinstance(ast, Function) or ast.name != 'main' or ast.deco['type'] != Type.VOID or len(ast.args)>0:
        raise Exception('Cannot find a valid entry point')
    symtable = SymbolTable()
    process_scope(ast, symtable)

def process_scope(fun, symtable):
    fun.deco['nonlocal'] = set() # set of nonlocal variable names in the function body, used in "readable" python transpilation only
    symtable.push_scope(fun.deco)
    for v in fun.args: # process function arguments
        symtable.add_var(*v)
    for v in fun.var:  # process local variables
        symtable.add_var(*v)
    for f in fun.fun:  # then process nested function bodies
        process_scope(f, symtable)
    for s in fun.body: # process the list of statements
        process_stat(s, symtable)
    symtable.pop_scope()

def process_stat(n, symtable): # process "statement" syntax tree nodes
    if isinstance(n, Print):
        process_expr(n.expr, symtable)
    elif isinstance(n, Return):
        if n.expr is None: return
        process_expr(n.expr, symtable)
    elif isinstance(n, Assign):
        process_expr(n.expr, symtable)
        deco = symtable.find_var(n.name)
        update_nonlocals(n.name, symtable) # used in "readable" python transpilation only
    elif isinstance(n, FunCall): # no type checking is necessary
        process_expr(n, symtable)
    elif isinstance(n, While):
        process_expr(n.expr, symtable)
        for s in n.body:
            process_stat(s, symtable)
    elif isinstance(n, IfThenElse):
        process_expr(n.expr, symtable)
        for s in n.ibody + n.ebody:
            process_stat(s, symtable)
    else:
        raise Exception('Unknown statement type')

def process_expr(n, symtable): # process "expression" syntax tree nodes
    if isinstance(n, ArithOp):
        process_expr(n.left,  symtable)
        process_expr(n.right, symtable)
    elif isinstance(n, LogicOp):
        process_expr(n.left,  symtable)
        process_expr(n.right, symtable)
    elif isinstance(n, Integer):
        pass
    elif isinstance(n, Boolean):
        pass
    elif isinstance(n, Var):
        deco = symtable.find_var(n.name)
        update_nonlocals(n.name, symtable) # used in "readable" python transpilation only
    elif isinstance(n, FunCall):
        for s in n.args:
            process_expr(s, symtable)
    elif isinstance(n, String):
        pass
    else:
        raise Exception('Unknown expression type', n)

def update_nonlocals(var, symtable):                    # add the variable name to the set of nonlocals
    for i in reversed(range(len(symtable.variables))):  # for all the enclosing scopes until we find the instance
        if var in symtable.variables[i]: break          # used in "readable" python transpilation only
        symtable.ret_stack[i]['nonlocal'].add(var)

Working committee available here, now scope.wend compiles correctly!

Type checking and function overloading

We’ve already encountered some rudimentary type checking of variables, but for the full picture of the world we need to add functions to the symbol table, because when we encounter the call f(x+1)then we know nothing about the return type fbecause a tree nodeFunCall stores only the ID f, nothing more. Everything is extremely trivial: parallel to the list of dictionaries of symbols of variables, let’s start a list of dictionaries of symbols of functions, and add them before opening the corresponding scope.

In the variable dictionary, the key was the id of the variable, with functions it’s a bit more complicated: I want to be able to overload functions, so the id is not a unique key. No worries, I’ll store the function signature as a key, which is a simple tuple (id, list of argument types).

I added a dozen lines to the symtable.py module and threw type mismatch exceptions to the semantic analyzer analyzer.py, see the changes in the code.

As a final touch, I added function overloading: for this, I just need to stick a unique suffix to the function name, and voila, we have a properly working wend compiler in python!

Buy the current code by tag v0.0.3.

In the next issue

This time we fixed the compiler using the keyword nonlocal. But let’s not lose sight of the fact that the target language is not Python, but assembler, and it knows absolutely nothing about closures! Therefore, next time we will talk about code generation, as before, in Python, but in principle without using variables inside functions. I will have only four global variables, and besides them, no other can be used: I will emulate the registers and the stack. And this is where the code will really lose readability, so that, after removing the head, they do not cry, you can already generate the assembler 🙂

Stay tuned!

Entertainment

And as a final entertainment, I allowed myself to program a fantasy on the theme of the game arcanoid:

If anyone is interested, here is the corresponding C code, only 65 lines. Any thoughts on how to improve are welcome, and throw in some interesting shortcode ideas!

Related posts