Trying to write the smallest assembler possible.
The excellent Nand2Tetris curriculum includes a project in which the student writes an assembler to transform Hack assembly language into Hack machine code. The Hack assembly language is minimal, with just two instruction types: A and C.
A
Instruction
The Hack A instruction loads a value into the A register, funnily enough.
This is the only way to load literal values into the system.
But the A register also doubles as the address register for both data (in RAM) and instructions (in ROM), so when you put a value like 24
in the A register, subsequent instructions can treat this as either RAM[24] or ROM[24] (or both, but that would be very unusual).
The syntax looks like this:
@24 // load a literal value @foo // load the value associated with a symbol
If loading a symbolic value, it must resolve to either a label (which denotes a ROM address) or an address in RAM.
A instructions are translated into a 16-bit binary number where the most significant bit is 0.
So if foo
happens to be a variable located at RAM[17], then it would be equivalent to write @17
.
After being translated to "binary", the program above would end up looking like this:
0000000000011000 0000000000010001
NOTE - it's not actual "binary", but a text representation with '0' and '1' characters, to make it easier to visually debug. This is an educational artifact, so the space and speed of turning it into true binary data is not needed.
C
Instruction
The C instruction is simultaneously more complex and more constrained than the A instruction.
It's more complex because instead of a single @
followed by a value, there are multiple parts:
dest = comp ; jump
And in addition to these multiple parts, the dest=
and ;jump
parts are optional.
However, it's also more constrained than an A instruction, because every possible value for each of the parts is known ahead of time, and can be looked up in a table.
For example, destinations are one of: A
, D
, or M
, or a combination of up to all 3.
Jumps can be one of JMP JEQ JLE JGE JNE JLT JGT
.
There are quite a few more comp
options, such as D-M
, D&M
, A+1
and others.
In total, the Hack CPU Emulator supports 28 comp
variants, and 8 each for dest
and jump
(including no dest
and no jump
since they're optional).
This means there are a total of 28*8*8 or 1792 different C instructions possible. By even ARM standards, that is a small instruction set.
However, this is the upper bound on the number of instructions. The A instruction, by contrast, can support any numeric value from 0 to 32767 (0000000000000000 to 0111111111111111 in binary).
It wouldn't be a very good assembly language without some form of goto
, and the Hack assembly comes through with symbolic labels.
In source code, they look like this:
(LOOP) D=M @LOOP 0;JMP // always jump back to LOOP
The above example executes D=M
, then loads the address of LOOP
, which refers to the instruction number of D=M
, then unconditionally jumps there.
There are several names which are built-in to the language, and which can not be redefined (they are "constant" values).
These include 15 "virtual registers" R0
through R15
, and some others like SP
, ARG
, SCREEN
, and KBD
.
Each of these refer to a specific address in RAM.
The textbook suggests building the assembler in a Java-like OOP language, but that is a terrible idea for code golf, because golfing code means making it as short as possible (like in the real golf game, lower scores are better).
However, there are some suggestions which are quite suitable in any style:
The reason to scan twice is that you can't tell without context whether @ASDF
is a label or a variable.
So you have to scan through the entire file.
If you see one occurrence of (ASDF)
, then ASDF
is a label.
Otherwise, it's a variable.
Technically, it's possible to build up a symbol table which finds ASDF
and isn't sure yet if it's a label or variable, and then resolve at the end once you've either found a label or not.
But instead of scanning the source code twice, it instead must scan the symbol table twice.
I imagine this approach would be slightly more complex, but it's possible it could be slightly faster when translating a very large assembly file with very few symbols.
The largest Hack assembly file I have is the pong program provided by the authors, and my assembler chews through it in a few milliseconds, so I'm not concerned about speed.
First, I used Python3.9.12 for this challenge, rather than a more verbose language like C or Java. I have in the past written this assembler in J, which is usually more terse than Python, but the approach I used didn't benefit very much from J's implicitly vectorized nature.
Specifically, I use a dictionary for the symbol table. Dicts in python can be initialized with a "dict comprehension" like this:
D = {k:v for k,v in zip(keys, values)}
Which can be much shorter than setting key-value pairs explicitly like this, especially if you have lots of key-value pairs:
D = {'a':1, 'b':2}
I defined single-letter aliases for some builtins like str.split
and also for a zip-to-dict function:
S=str.split Z=lambda x,y=-1:dict(zip(S(x,x[0]),range(y,120)))
C
Instruction Encoding
When scanning the file, the comp
parts of the C instruction are retrieved from a look-up table, so for example if you find the code M+1
then it should translate into 1110111
according to the documentation.
Well, building a table like this is not too hard:
lookup = { 'M+1': '1110111', ... }
However, it's very verbose.
One optimization is to notice that 1110111
in binary is the same as 119
in decimal, so you can change this lookup table to use decimal encoded numbers instead, and turn them into binary at the very end:
lookup = { 'M+1': 119, ... }
The same idea can be applied to the dest
and jump
parts, but it's even simpler than for the comp
part, because there are no "missing" entries:
D=Z(' M D MD A AM AD AMD',0) J=Z(' GT EQ GE LT NE LE MP',0)
This produces a dest
dictionary whose keys match every possible dest
and whose values can be converted to binary to form the dest
part of an instruction.
{'': 0, 'M': 1, 'D': 2, 'MD': 3, 'A': 4, 'AM': 5, 'AD': 6, 'AMD': 7}
;
statement separator when it helpsx:=y
assignment expression when it helps4**7
instead of 16384
to save one byte{**D1,**D2}
to merge dictionarieswith ...
context manager, just exit the program to "clean up"I don't want to spoil the fun for any prospective learners, so I won't share my complete code here. But I will share some information about it:
sys
because sys.argv[1]
has the name of the file to assemblecomp
encoding) is 165 charactersstr
methods used: split
, strip
, isdigit
print
and range
both appear twice, zip
, dict
, and open
each appear once+=
and one :=
)dict
comprehensionfor
loopsif
elif