A deep dive into the Nand2Tetris VM language, and what it means to write an interpreter for a compiled language.
The point of a virtual machine is to make writing multi-layered software easier. The idea is we have multiple computer architectures, and multiple computer languages that we want to eventually compile to those target architectures. But this means a lot of duplicated work - each language needs:
So if you have N
languages and M
architectures, then supporting the full matrix takes N*M
effort:
Figure 1: N times M compilers
However, if you split the "front-end" tasks of tokenizing and parsing from "back-end" tasks like code generation, you get a different pattern:
Figure 2: N plus M compilers
By adding a Virtual Machine as an intermediate stage, you can make it less costly to add new languages and/or architectures, because the complexity is N+M
rather than N*M
.
Now, I should take a moment here to point out that this is only worth doing if 3 things are simultaneously true:
For most people building a programming language, the number of architectures worth supporting is likely 1 to 3. Likewise, part of the reason to make a programming language is to do things differently from others, so your new language might not have much in common with existing languages. This means more difficulty finding commonalities a VM can exploit.
However, there is an important exception to this advice. The LLVM project demonstrates that a modular compiler architecture is useful for multiple different languages. Front end interfaces exist for LLVM to support languages such as C, C++, Rust, Pony, and Scheme. If you're interested in making a real programming language, it might be a great idea to take advantage of the great work of LLVM's optimization and code generation back-ends, and focus your efforts on the syntax and semantics of your language by writing a front-end for LLVM.
That said, if you want to learn more about writing a stripped-down VM, read on.
In the Nand2Tetris curriculum, chapters 7 and 8 guide you through the task of building a VM Translator program.
This program translates Hack Virtual Machine sources in .vm
file format into Hack Assembly .asm
files.
Here's a sample of Hack VM language:
push constant 7 push constant 8 sub
This program pushes 7, then 8, onto a stack, pops both to perform the subtraction (7-8) and pushes the result (-1) onto the stack. In the end, the stack will contain only the value -1.
The language also supports user-defined functions and function calls:
function foo 1 // declare a function named foo that has 1 local variable push constant 42 pop local 0 // local[0] = 42 push argument 0 // copy arg[0] to stack push argument 1 // copy arg[1] to stack sub // top of stack: arg[0]-arg[1] push local 0 // 42 call bar 2 // call bar with 2 arguments (top 2 stack values) neg // negate it return // return the negated vesion of the above expression
Following the Nand2Tetris curriculum for chapter 7, implement support first for the arithmetic operations (add
, sub
, etc.) and then for control flow commands like if-goto
and label
.
Then for chapter 8, implement function
, call
, and return
.
The function-related commands of chapter 8 are non-trivial and require pushing values to the stack at runtime.
This is also in my opinion the most interesting part of the whole course, because it requires you to think carefully about what kinds of transformations your program can do to the source code, and what tasks belong to the runtime environment.
The key part of chapter 8 is the function calling convention.
This convention dictates what happens at runtime when the code issues a call
instruction, and what happens during a return
instruction.
In short, the caller first pushes any arguments the callee needs to the stack, then it pushes its own return address, then it pushes the values of LCL
, ARG
, THIS
, and THAT
to the stack.
It then sets the value of ARG
to point at the first argument it pushed earlier.
Finally, it sets LCL
to the current SP
value and jumps to the function it's calling.
This jump is technically a goto
that is allowed to leave the function's scope.
For example, the function calling convention requires pushing the current values for SP
, ARG
, LCL
, THIS
, and THAT
to the stack before a function call.
These are the names of host RAM addresses 0-4, respectively.
The currently-running function uses these values to determine where it should look in RAM for some values it needs.
Other functions will probably have different values, so something needs to keep track.
That "something" happens at runtime.
By the end of chapter 8, you'll have a translator that can take multiple .vm
files and combine them to make a single .asm
file.
Individual files can contain multiple functions, but "static" variables are private to their containing file.
The biggest assumption of the translator is that there is a Sys.init
function which sets up some pointers, then calls main
, and finally when main
returns, goes into an infinite loop.
In chapter 7, these aspects are not as important, and you can test just the arithmetic and control flow commands by allowing the VM emulator to do the tasks which would otherwise be done by Sys.init
.
But the additions of chapter 8 introduce some subtle edge cases in the language.
For example, the language specifies that goto
and if-goto
may not leave their containing function.
However, the chapter 7 tests use gotos outside of any function; in other words, the language has global gotos.
But these are more like unit tests than system or integration tests, and I think the intention is that "real" programs will always start by implicitly calling Sys.init
, which never returns.
So even though the language permits global gotos, in practice they are never used.
Another edge case of the language is the fact that there is no syntax to mark the end of a function. To demonstrate, first consider a function with an obvious ending:
function f1 0 return // f1 obviously ends here
But some functions have ambiguous endings:
function f2 0 goto NEXT label START return // f2 returns here, so the function's control ends here label NEXT goto START // but the text of the function actually ends here
In practice, the VM Translator can't tell where a function ends unless it either: a) reaches the end of the file, or b) reaches a new function definition.
A subtle consequence of this is that it's not possible to mix global code with functions, and therefore a VM Translator program must be entirely function definitions.
These are over-simplified, but workable definitions. In particular, the Nand2Tetris VM Translator is a compiler, because it transforms VM sources into Hack assembly code. But I think it would be fun to have an interpreter for this language, to enable interactive explorations and to get a more visceral feel for how it works. So I'm building one.
In this interpreter, I want to retain the feel of the original VM language.
It still uses an array of signed 16 bit values as its host RAM, and the first 5 addresses represent SP
, LCL
, ARG
, THIS
, and THAT
pointers.
But I'm building it in a webpage, so I need to make some minor changes.
The first is that filesystem access is a bit… weird in a webpage context, so I need to figure out how to handle multiple .vm
sources.
Maybe they will be files you have to upload to the webpage, or maybe there will be a place to copy-paste code.
That's a problem for my future self, though.
What I have at the moment is closer to a compiler than an interpreter. It passes all the chapter 7 tests and most of the chapter 8 tests. The current failing test that I'm debugging involves function calls, so I suspect either my caller or callee is not doing the right thing.
end
token
The (compiled) language in practice does not permit mixing functions and global code in the same .vm
file.
But in an interpreted context, most of the code will be global.
This means the interpreter needs to know where functions end. One way to do this would be a keyboard shortcut or button in the interpreter, to switch from "regular" mode to "function defining" mode. But I dislike this approach - it hides too much information.
I prefer to change the language.
A syntactic token such as end
could spell this out in the syntax:
function f3 0 return end push constant 21 // executable code... call f3 0 // between function definitions! function f4 0 return end
This new language is incompatible with the original langauge, but at least it's easy to see the incompatibility. When writing software, I prefer "obviously no bugs" to "no obvious bugs"1, so I like this kind of visibility.
goto
An ahead-of-time compiled program knows all the possible destinations for all the goto
commands in the program, so jumping forward or backward is no problem.
But an interpreted program executes one line at a time, so jumping forward would require time travel, and even jumping backward would imply taking control away from the user, making the program non-interactive.
Both of these situations seem bad, so I hereby abolish the global goto
.
This is a half-baked idea (something for my future self to figure out) but if files are currently used for namespaces, then in a webpage context, we could use something else to provide namespace-like behavior.
In the compiled VM Translator world, references to static
locations resolve to unique names in the generated assembly.
For example, the command push static 3
in a VM file Abc.vm
becomes a Hack assembly symbol Abc.3
.
Similarly, you can use the same name for different functions, as long as those functions are in different files. So if we're changing the language syntax, we may as well add namespaces:
namespace Abc function f 0 push argument 0 return end end namespace Xyz function f 0 //Xyz.f is a different function, same name as Abc.f pop local 2 return end end
In the above example I reused the function-ending end
token for ending namespaces too.
C.A.R. Hoare, The Emperor's Old Clothes, 1980.