Bytecode Indexing

A compact method of decoding bytecode.

Bytecode

Let's pretend we have a programming language in which mnemonics like push or swap map to small numbers like 2 and 5. We want to transform the human-written version of the language into the machine-readable numbers. Here is the mapping of words-to-numbers:

{
  'push': 0,
  'pop': 1,
  'dup': 3,
  'swap': 7,
  'sub': 11,
  'store': 15
}

A small example program would look like push pop sub. Obviously, a high level language compiler could use a lookup table exactly like this and call it a day. But let's optimize it!

Indexing

Imagine for a moment that the numbers had no gaps:

{
  'push': 0,
  'pop': 1,
  'dup': 2,
  'swap': 3,
  'sub': 4,
  'store': 5
}

This would be much easier to optimize, because we could simply increment a counter while scanning through the words until we find a match:

words = ['push', 'pop', 'dup', 'swap', 'sub', 'store']
for word in ['push', 'pop', 'dup', 'sub']:  #example program
  print(words.index(word))
  # 0
  # 1
  # 2
  # 4

However, the actual numbers are unevenly spaced, but there is a workaround. We can pad our list so the indexes line up correctly:

words = ['push', 'pop', None, 'dup', None, None, None, 'swap',
         None, None, None, 'sub', None, None, None, 'store']
for word in ['push', 'pop', 'dup', 'sub']:
  print(words.index(word))
  # 0
  # 1
  # 3
  # 11

Why?

For general purpose code, of course you should not do this. In Python, the list.index method has linear time complexity, but the dictionary lookup is constant time on average, and only linear in worst case. Also, a small change to the bytecode could cause our padding to explode:

{
  'push': 0,
  'pop': 1,
  'dup': 3,
  'swap': 7,
  'sub': 11,
  'store': 150
}

Changing the opcode for store from 15 to 150 grows the padded list by almost 10x. This technique is neither general nor does it have particularly good performance.

But there is one situation in which it is optimal. Code golf.

Size

If run-time performance and code readability are not important, the indexing method can result in compact source code. Note the extra blank spaces in the string, which become padding in the list after calling the split method:

for word in ['push','pop','dup','sub']:
 print('push pop  dup    swap    sub    store'.split(' ').index(word))

Contrast with the dictionary version (which is actually shorter):

for word in ['push','pop','dup','sub']:
 print({'push':0,'pop':1,'dup':3,'swap':7,'sub':11,'store':15}[word])

You can also use the index of a word itself as an index into a list of all the bytecode values:

for word in ['push','pop','dup','sub']:
 print([0,1,3,7,11,15]['push pop dup swap sub store'.split().index(word)])

We could also look at unique prefixes of the words:

push  pu
pop   po
dup   du
swap  sw
sub   su
store st

…and calculate the appropriate indexes from those:

for word in ['push','pop','dup','sub']:
 print([0,1,3,7,11,15]['pupoduswsust'.index(word[:2])//2])

These examples so far are perfect hash functions. First of all, they are hash functions because they turn a string input into a numeric output. But because there is exactly one value for each key and no collisions, we can call them perfect hash functions.