Stack VM Language

A deep dive into the Nand2Tetris VM language, and what it means to write an interpreter for a compiled language.

Why virtual machines?

The point of a virtual machine is to make writing multi-layered software easier. The idea is we have multiple computer architectures, and multiple computer languages that we want to eventually compile to those target architectures. But this means a lot of duplicated work - each language needs:

tokenizing
parsing
code generation

So if you have N languages and M architectures, then supporting the full matrix takes N*M effort:

Figure 1: N times M compilers

However, if you split the "front-end" tasks of tokenizing and parsing from "back-end" tasks like code generation, you get a different pattern:

Figure 2: N plus M compilers

By adding a Virtual Machine as an intermediate stage, you can make it less costly to add new languages and/or architectures, because the complexity is N+M rather than N*M.

Now, I should take a moment here to point out that this is only worth doing if 3 things are simultaneously true:

Extracting the VM into a separate part is not a big deal.
You plan to have more than 2 languages.
You plan to have more than 2 architecture targets.

For most people building a programming language, the number of architectures worth supporting is likely 1 to 3. Likewise, part of the reason to make a programming language is to do things differently from others, so your new language might not have much in common with existing languages. This means more difficulty finding commonalities a VM can exploit.

However, there is an important exception to this advice. The LLVM project demonstrates that a modular compiler architecture is useful for multiple different languages. Front end interfaces exist for LLVM to support languages such as C, C++, Rust, Pony, and Scheme. If you're interested in making a real programming language, it might be a great idea to take advantage of the great work of LLVM's optimization and code generation back-ends, and focus your efforts on the syntax and semantics of your language by writing a front-end for LLVM.

That said, if you want to learn more about writing a stripped-down VM, read on.

Nand2Tetris VM translator

In the Nand2Tetris curriculum, chapters 7 and 8 guide you through the task of building a VM Translator program. This program translates Hack Virtual Machine sources in .vm file format into Hack Assembly .asm files.

Here's a sample of Hack VM language:

push constant 7
push constant 8
sub

This program pushes 7, then 8, onto a stack, pops both to perform the subtraction (7-8) and pushes the result (-1) onto the stack. In the end, the stack will contain only the value -1.

The language also supports user-defined functions and function calls:

function foo 1   // declare a function named foo that has 1 local variable
  push constant 42
  pop local 0      // local[0] = 42
  push argument 0  // copy arg[0] to stack
  push argument 1  // copy arg[1] to stack
  sub              // top of stack: arg[0]-arg[1]
  push local 0     // 42
  call bar 2       // call bar with 2 arguments (top 2 stack values)
  neg              // negate it
  return           // return the negated vesion of the above expression

Following the Nand2Tetris curriculum for chapter 7, implement support first for the arithmetic operations (add, sub, etc.) and then for control flow commands like if-goto and label. Then for chapter 8, implement function, call, and return.

The function-related commands of chapter 8 are non-trivial and require pushing values to the stack at runtime. This is also in my opinion the most interesting part of the whole course, because it requires you to think carefully about what kinds of transformations your program can do to the source code, and what tasks belong to the runtime environment. The key part of chapter 8 is the function calling convention. This convention dictates what happens at runtime when the code issues a call instruction, and what happens during a return instruction.

In short, the caller first pushes any arguments the callee needs to the stack, then it pushes its own return address, then it pushes the values of LCL, ARG, THIS, and THAT to the stack. It then sets the value of ARG to point at the first argument it pushed earlier. Finally, it sets LCL to the current SP value and jumps to the function it's calling. This jump is technically a goto that is allowed to leave the function's scope.

For example, the function calling convention requires pushing the current values for SP, ARG, LCL, THIS, and THAT to the stack before a function call. These are the names of host RAM addresses 0-4, respectively. The currently-running function uses these values to determine where it should look in RAM for some values it needs. Other functions will probably have different values, so something needs to keep track. That "something" happens at runtime.

"Translator" assumptions

By the end of chapter 8, you'll have a translator that can take multiple .vm files and combine them to make a single .asm file. Individual files can contain multiple functions, but "static" variables are private to their containing file.

The biggest assumption of the translator is that there is a Sys.init function which sets up some pointers, then calls main, and finally when main returns, goes into an infinite loop.

In chapter 7, these aspects are not as important, and you can test just the arithmetic and control flow commands by allowing the VM emulator to do the tasks which would otherwise be done by Sys.init.

But the additions of chapter 8 introduce some subtle edge cases in the language. For example, the language specifies that goto and if-goto may not leave their containing function. However, the chapter 7 tests use gotos outside of any function; in other words, the language has global gotos.

But these are more like unit tests than system or integration tests, and I think the intention is that "real" programs will always start by implicitly calling Sys.init, which never returns. So even though the language permits global gotos, in practice they are never used.

Another edge case of the language is the fact that there is no syntax to mark the end of a function. To demonstrate, first consider a function with an obvious ending:

function f1 0
  return  // f1 obviously ends here

But some functions have ambiguous endings:

function f2 0
  goto NEXT
  label START
  return     // f2 returns here, so the function's control ends here
  label NEXT
  goto START // but the text of the function actually ends here

In practice, the VM Translator can't tell where a function ends unless it either: a) reaches the end of the file, or b) reaches a new function definition.

A subtle consequence of this is that it's not possible to mix global code with functions, and therefore a VM Translator program must be entirely function definitions.

Compiled versus interpreted

compilation means transforming source code from one language to another; often, the destination language executable
interpretation means executing source code of a language

These are over-simplified, but workable definitions. In particular, the Nand2Tetris VM Translator is a compiler, because it transforms VM sources into Hack assembly code. But I think it would be fun to have an interpreter for this language, to enable interactive explorations and to get a more visceral feel for how it works. So I'm building one.

In this interpreter, I want to retain the feel of the original VM language. It still uses an array of signed 16 bit values as its host RAM, and the first 5 addresses represent SP, LCL, ARG, THIS, and THAT pointers. But I'm building it in a webpage, so I need to make some minor changes. The first is that filesystem access is a bit… weird in a webpage context, so I need to figure out how to handle multiple .vm sources. Maybe they will be files you have to upload to the webpage, or maybe there will be a place to copy-paste code. That's a problem for my future self, though.

What I have at the moment is closer to a compiler than an interpreter. It passes all the chapter 7 tests and most of the chapter 8 tests. The current failing test that I'm debugging involves function calls, so I suspect either my caller or callee is not doing the right thing.

Language changes

Explicit function `end` token

The (compiled) language in practice does not permit mixing functions and global code in the same .vm file. But in an interpreted context, most of the code will be global.

This means the interpreter needs to know where functions end. One way to do this would be a keyboard shortcut or button in the interpreter, to switch from "regular" mode to "function defining" mode. But I dislike this approach - it hides too much information.

I prefer to change the language. A syntactic token such as end could spell this out in the syntax:

function f3 0
  return
end

push constant 21  // executable code...
call f3 0         // between function definitions!

function f4 0
  return
end

This new language is incompatible with the original langauge, but at least it's easy to see the incompatibility. When writing software, I prefer "obviously no bugs" to "no obvious bugs"¹, so I like this kind of visibility.

No more global `goto`

An ahead-of-time compiled program knows all the possible destinations for all the goto commands in the program, so jumping forward or backward is no problem. But an interpreted program executes one line at a time, so jumping forward would require time travel, and even jumping backward would imply taking control away from the user, making the program non-interactive. Both of these situations seem bad, so I hereby abolish the global goto.

Modules instead of files(?)

This is a half-baked idea (something for my future self to figure out) but if files are currently used for namespaces, then in a webpage context, we could use something else to provide namespace-like behavior.

In the compiled VM Translator world, references to static locations resolve to unique names in the generated assembly. For example, the command push static 3 in a VM file Abc.vm becomes a Hack assembly symbol Abc.3.

Similarly, you can use the same name for different functions, as long as those functions are in different files. So if we're changing the language syntax, we may as well add namespaces:

namespace Abc
 function f 0
  push argument 0
  return
 end
end

namespace Xyz
 function f 0 //Xyz.f is a different function, same name as Abc.f
  pop local 2
  return
 end
end

In the above example I reused the function-ending end token for ending namespaces too.

Footnotes:

C.A.R. Hoare, The Emperor's Old Clothes, 1980.