UMBC CMSC431, Compiler Design Principles, Fall 2009

Project 2

Due: Monday, October 5, 2009, 12:00pm (noon)


In this project the students will take the first step in designing their own programming language and will incorporate symbol tables in their compilers.


Your assignment is to extend the compiler you wrote for Project 1 to allow variables to be used in expressions. To do this, revise the compiler so that a "program" is now a sequence of "statements" where each statement might be: Furthermore, the variables and constants in the expressions can be integer or floating point values (details below).

As in Project 1, the output of your compiler must be an assembly language program that can be assembled with NASM to produce machine code that corresponds to the input source code.

As will be the case for the remainder of the semester, you will have much flexibility in the design of the syntax and semantics of your programming language. For example, this project description does not specify the rules for variable names. You should pick something reasonable. The same holds for the syntax of variable declarations, print statements and assignments.


Expressions. Expressions can be floating point or integer, but not both. Use a decimal point to distinguish between a floating point constant from an integer. (E.g. 3.0 is a floating point constant and 3 is an integer constant.) If the source code mixes floating point and integer expressions (e.g., (x + 3) * (y - 4.0), your compiler should point this out as a type mismatch error. (You don't need to worry about type promotion in this project.) Floating point values should be stored as 8-byte (quad-word) values.

Declarations. Variables must be declared before their first use. The declaration must specify the type of the variable. Otherwise, the compiler cannot infer the type of an expression like x + y. Have an extensible scheme for declarations. You will at least have to allow for Boolean types when we add if statements to the language.

Assignments. As with expressions, you do not have to worry about type promotion. Assigning a floating point expression to an integer variable and vice versa are errors (for now).

Printing. In Project 1, the program automatically printed the value of expression. We don't want that behavior here because an expression might be part of an assignment statement. Instead, we require that the programmer use a print statement explicitly. For flexibility in formatting the output, have two versions of the print statement: one that prints a newline and one that does not.

Implementation Notes

CVS. You should first use CVS to checkout Project 2.

cvs -d /afs/ checkout -d MyProj2 user1 No files are distributed with Project 2, but this sets up the CVS directory and lets you submit your files for debugging help. Copy the files from your Project 1 directory that you want to carry over. Then for each file use the cvs add filename command to tell CVS that you want that file to be synchronized. (This does not checkin the file until your next cvs commit.)

Yacc. You should review the lex and yacc programs that allow assignment to single character variables and note how %union is used in conjunction with %token, %left, %right and %type to declare the type of tokens and nonterminals. Also, note the use of union type field names in the lex program: ch3-03.l, ch3-03.y and ch3-03.txt.

For the project, you would not want the expression nonterminal to have double type since you are generating assembly language code rather than evaluating the expression. Instead you want the value associated with expression to tell you the type of the expression (floating-point or integer). For example, when the rule

expression : expression '+' expression ; is reduced, you need to know the type of the expressions on the right so you can generate code for either integer addition or for floating-point addition (or print out a type mismatch error message).

Assembly language. For this project, you will still keep the results of sub-expressions on the top of the stack pointed by the ESP register, even for 8-byte floating-point values. Although the x87 FPU has its own stack, that stack is limited to 8 items and it is difficult in yacc to determine whether the depth of an expression exceeds 8.

Memory allocation (assembly). The variables used in this phase of the compiler will correspond to global variables in the final compiler. Memory for the variables should be allocated in a .DATA section of the assembly language program generated by your compiler. (Note: you can have more than one .DATA section in your assembly code.) There are two general approaches for memory allocation. You can either allocate memory all at once or use the variable name as a label.

In the first approach, you keep track of the total amount of memory you need for all of the variables and reserve that amount of memory in one big chunk. For example, if you need 32 bytes of memory, then the following assembler directive will do the trick:

global_vars: times 32 db 0 The db 0 directive defines a single byte of memory initialized to 0. The times 32 prefix says to do this 32 times. The address of the first byte of memory is labeled by global_vars. Each variable in your program can be accessed as an offset from global_vars. For example: push DWORD [global_vars + 16] will push 4 bytes (double word) from memory onto the stack starting with the 16th byte after global_vars. Note that this is not a special addressing mode, it is the plain vanilla indirect addressing mode. Both global_vars and 16 are constants. NASM just does the addition for you (with some help from the linker). You will need to store the offset for each variable in the symbol table entry for that variable.

The advantage of the first approach is that global variables are accessed in the same way that local variables and actual parameters will be accessed in later projects. (Those are offsets from the [EBP].) The disadvantage is that these global variables will not be accessible from external code.

In the second approach, you simply use the variable's name as a label in the assembly code. For example, a floating point variable x will be allocated by:

x: dq 0.0 The advantage here is that the label x can be declared global and accessed from code in other object files. There is also much elegance in the simplicity of the naming scheme. However, in this approach, you have to be very careful about the labels you use in other parts of the program. For example, a program that uses a variable named printf_ctrl should not cause problems. If you take this approach, you should read Section 3.1 of the NASM documentation which explains the rules for NASM labels. (N.B.: the characters . and $ have special meaning as the first character of a label.)

C Programming Practices. When you use lex and yacc, you pretty much have to lift the ban on using global variables. (Alternatives are possible. See Section 3.7.10 A Pure (Reentrant) Parser of the Bison Manual.)

However, this does not mean that we want to throw away every rule of good programming practice. One rule that you should still very much stick to is the edict to avoid ``magic numbers''. For example, if you use a constant 4 in the middle of your yacc source file, it is not clear whether the 4 is there because integer values take 4 bytes or if addresses take 4 bytes. (I.e., if you wanted to modify the code to work for 2-byte integers, should you change this 4?) It is much better to use a constant, say INT_SIZE, defined using the #define preprocessor directive:

#define INT_SIZE 4

Furthermore, in cases where you don't really need the value of a constant, but just want some way to label things, it is better to use an enumeration type. For example, you can define etype as follows:

typedef enum etype_tag {int_type, float_type} etype ; Henceforth, etype can be used to declare variables that take on the values int_type and float_type. This is preferable to defining two constants #define INT_TYPE 0 #define FLOAT_TYPE 1 because the C compiler will do type checking with enumeration types whereas INT_TYPE and FLOAT_TYPE are just place holders for integer constants 0 and 1.

Symbol table. The symbol table is a critical part of your compiler. For now, you just need the symbol table to remember the type of each variable and possibly its offset in memory (depending on the memory allocation scheme you choose to follow). Later, you will also use the symbol table to remember whether an identifier is a variable or a function. In case of variables, the symbol table must also keep track of whether the variable is a local variable, a global variable or a parameter. In case of functions, the types of its return value and of its parameters are also stored in the symbol table. At this stage, you should keep the design of your symbol table flexible. You will almost certainly need to add fields to its entries.

The two basic functions of the symbol table are 1) inserting a new item and 2) finding an existing item. Since a symbol table find operation is needed every time the compiler encounters an identifier, we need the data structure to support very fast find operations. This can be accomplished using a hash table which supports O(1) average running time for insertion and find. You can choose any hashing scheme as long as:

  1. the hash function depends on all characters of the identifier;
  2. supports O(1) average running time for insertion and find, at least in practice (i.e., there's no proof for the running time); and
  3. handles collisions.

Alternatively, you can use the hash_map class from the C++ Standard Template Library. (See documentation.) Actually, hash_map is not quite part of the STL, so you need to add "ext" to the include path, like so:

#include <ext/hashmap> Using hashmap from C++ is not for free. You will need to figure out how use hashmap properly. (Hint: yes, you do need to write a callback function for comparisons.) Also, since C does not "understand" C++ classes, you will need to write wrapper functions for C in C++. These have to prefixed with extern "C" The C++ FAQ Lite has a section on mixing C and C++ code: [32] How to mix C and C++. Finally, you must make sure that dynamically allocated members are destroyed properly. This is problematic since anything that a C program "understands" will not have a destructor.

Memory allocation (in C). You have to be very careful with dynamically allocated strings. Between lex, yacc and the symbol table, you need to have a clear protocol for whether a function has its own copy of the string and when the memory for that string is released.

Recall that the characters matched by lex are stored in yytext. In the case when you've matched a variable name, you will want to give the yacc parser access to that name. If you store a pointer in yylval, is that a separate copy of the string? or a pointer into yytext? Is that safe? If you store a name in a hash table, does the hash table make a copy of that string? or does it only store a pointer to that string? If it makes a copy, when is memory for the string released? If the hash table doesn't make a copy, you have to be sure that memory for the string is not freed until the hash table is done with it. Here are some issues:

What to turn in

By the due date and time, the CVS repository for your project should have updated versions of the flex and bison programs. Include an explanation of any oddities in a README file. Please update the Makefile to reflect any changes you have made. (The expectation is that a person grading the project can compile your program just by typing "make".) Also, include 3 test programs that fully exercise your compiler. The test programs should also demonstrate the syntax of your programming language's variables, assignments and print statements.

Last Modified: 22 Sep 2009 09:25:20 EDT by Richard Chang
to Fall 2009 CMSC 431 Homepage