Assembly Golf

Trying to write the smallest assembler possible.

Hack Assembler

The excellent Nand2Tetris curriculum includes a project in which the student writes an assembler to transform Hack assembly language into Hack machine code. The Hack assembly language is minimal, with just two instruction types: A and C.

`A` Instruction

The Hack A instruction loads a value into the A register, funnily enough. This is the only way to load literal values into the system. But the A register also doubles as the address register for both data (in RAM) and instructions (in ROM), so when you put a value like 24 in the A register, subsequent instructions can treat this as either RAM[24] or ROM[24] (or both, but that would be very unusual).

The syntax looks like this:

@24 // load a literal value
@foo // load the value associated with a symbol

If loading a symbolic value, it must resolve to either a label (which denotes a ROM address) or an address in RAM.

A instructions are translated into a 16-bit binary number where the most significant bit is 0. So if foo happens to be a variable located at RAM[17], then it would be equivalent to write @17.

After being translated to "binary", the program above would end up looking like this:

0000000000011000
0000000000010001

NOTE - it's not actual "binary", but a text representation with '0' and '1' characters, to make it easier to visually debug. This is an educational artifact, so the space and speed of turning it into true binary data is not needed.

`C` Instruction

The C instruction is simultaneously more complex and more constrained than the A instruction. It's more complex because instead of a single @ followed by a value, there are multiple parts:

dest = comp ; jump

And in addition to these multiple parts, the dest= and ;jump parts are optional.

However, it's also more constrained than an A instruction, because every possible value for each of the parts is known ahead of time, and can be looked up in a table.

For example, destinations are one of: A, D, or M, or a combination of up to all 3. Jumps can be one of JMP JEQ JLE JGE JNE JLT JGT.

There are quite a few more comp options, such as D-M, D&M, A+1 and others. In total, the Hack CPU Emulator supports 28 comp variants, and 8 each for dest and jump (including no dest and no jump since they're optional).

This means there are a total of 28*8*8 or 1792 different C instructions possible. By even ARM standards, that is a small instruction set.

However, this is the upper bound on the number of instructions. The A instruction, by contrast, can support any numeric value from 0 to 32767 (0000000000000000 to 0111111111111111 in binary).

Labels

It wouldn't be a very good assembly language without some form of goto, and the Hack assembly comes through with symbolic labels. In source code, they look like this:

(LOOP)
  D=M
  @LOOP
  0;JMP // always jump back to LOOP

The above example executes D=M, then loads the address of LOOP, which refers to the instruction number of D=M, then unconditionally jumps there.

Built-in Constant Symobls

There are several names which are built-in to the language, and which can not be redefined (they are "constant" values). These include 15 "virtual registers" R0 through R15, and some others like SP, ARG, SCREEN, and KBD. Each of these refer to a specific address in RAM.

Golfing an Assembler

The textbook suggests building the assembler in a Java-like OOP language, but that is a terrible idea for code golf, because golfing code means making it as short as possible (like in the real golf game, lower scores are better).

However, there are some suggestions which are quite suitable in any style:

initialize a symbol table with the constant symbols first
scan through the assembly file to find labels
scan through again to find variables

The reason to scan twice is that you can't tell without context whether @ASDF is a label or a variable. So you have to scan through the entire file. If you see one occurrence of (ASDF), then ASDF is a label. Otherwise, it's a variable.

Technically, it's possible to build up a symbol table which finds ASDF and isn't sure yet if it's a label or variable, and then resolve at the end once you've either found a label or not. But instead of scanning the source code twice, it instead must scan the symbol table twice. I imagine this approach would be slightly more complex, but it's possible it could be slightly faster when translating a very large assembly file with very few symbols. The largest Hack assembly file I have is the pong program provided by the authors, and my assembler chews through it in a few milliseconds, so I'm not concerned about speed.

Golfing Strategies

First, I used Python3.9.12 for this challenge, rather than a more verbose language like C or Java. I have in the past written this assembler in J, which is usually more terse than Python, but the approach I used didn't benefit very much from J's implicitly vectorized nature.

Use Dictionaries

Specifically, I use a dictionary for the symbol table. Dicts in python can be initialized with a "dict comprehension" like this:

D = {k:v for k,v in zip(keys, values)}

Which can be much shorter than setting key-value pairs explicitly like this, especially if you have lots of key-value pairs:

D = {'a':1, 'b':2}

Use Helper Functions

I defined single-letter aliases for some builtins like str.split and also for a zip-to-dict function:

S=str.split
Z=lambda x,y=-1:dict(zip(S(x,x[0]),range(y,120)))

`C` Instruction Encoding

When scanning the file, the comp parts of the C instruction are retrieved from a look-up table, so for example if you find the code M+1 then it should translate into 1110111 according to the documentation. Well, building a table like this is not too hard:

lookup = {
    'M+1': '1110111',
    ...
}

However, it's very verbose. One optimization is to notice that 1110111 in binary is the same as 119 in decimal, so you can change this lookup table to use decimal encoded numbers instead, and turn them into binary at the very end:

lookup = {
    'M+1': 119,
    ...
}

The same idea can be applied to the dest and jump parts, but it's even simpler than for the comp part, because there are no "missing" entries:

D=Z(' M D MD A AM AD AMD',0)
J=Z(' GT EQ GE LT NE LE MP',0)

This produces a dest dictionary whose keys match every possible dest and whose values can be converted to binary to form the dest part of an instruction.

{'': 0, 'M': 1, 'D': 2, 'MD': 3, 'A': 4, 'AM': 5, 'AD': 6, 'AMD': 7}

Miscellaneous

f-strings
1-space indent
; statement separator when it helps
x:=y assignment expression when it helps
4**7 instead of 16384 to save one byte
1-letter variable names
{**D1,**D2} to merge dictionaries
no error checking
don't close files or use the with ... context manager, just exit the program to "clean up"

Results

I don't want to spoil the fun for any prospective learners, so I won't share my complete code here. But I will share some information about it:

735 bytes
23 lines
the only library used is sys because sys.argv[1] has the name of the file to assemble
output is printed to console rather than written to a file
the longest line (used for comp encoding) is 165 characters
str methods used: split, strip, isdigit
print and range both appear twice, zip, dict, and open each appear once
there are 20 assignments (including three += and one :=)
1 dict comprehension
1 generator comprehension
2 for loops
six if
one elif