A compact method of decoding bytecode.
Let's pretend we have a programming language in which mnemonics like push
or swap
map to small numbers like 2
and 5
.
We want to transform the human-written version of the language into the machine-readable numbers.
Here is the mapping of words-to-numbers:
{ 'push': 0, 'pop': 1, 'dup': 3, 'swap': 7, 'sub': 11, 'store': 15 }
A small example program would look like push pop sub
.
Obviously, a high level language compiler could use a lookup table exactly like this and call it a day.
But let's optimize it!
Imagine for a moment that the numbers had no gaps:
{ 'push': 0, 'pop': 1, 'dup': 2, 'swap': 3, 'sub': 4, 'store': 5 }
This would be much easier to optimize, because we could simply increment a counter while scanning through the words until we find a match:
words = ['push', 'pop', 'dup', 'swap', 'sub', 'store'] for word in ['push', 'pop', 'dup', 'sub']: #example program print(words.index(word)) # 0 # 1 # 2 # 4
However, the actual numbers are unevenly spaced, but there is a workaround. We can pad our list so the indexes line up correctly:
words = ['push', 'pop', None, 'dup', None, None, None, 'swap', None, None, None, 'sub', None, None, None, 'store'] for word in ['push', 'pop', 'dup', 'sub']: print(words.index(word)) # 0 # 1 # 3 # 11
For general purpose code, of course you should not do this.
In Python, the list.index
method has linear time complexity, but the dictionary lookup is constant time on average, and only linear in worst case.
Also, a small change to the bytecode could cause our padding to explode:
{ 'push': 0, 'pop': 1, 'dup': 3, 'swap': 7, 'sub': 11, 'store': 150 }
Changing the opcode for store
from 15
to 150
grows the padded list by almost 10x.
This technique is neither general nor does it have particularly good performance.
But there is one situation in which it is optimal. Code golf.
If run-time performance and code readability are not important, the indexing method can result in compact source code.
Note the extra blank spaces in the string, which become padding in the list after calling the split
method:
for word in ['push','pop','dup','sub']: print('push pop dup swap sub store'.split(' ').index(word))
Contrast with the dictionary version (which is actually shorter):
for word in ['push','pop','dup','sub']: print({'push':0,'pop':1,'dup':3,'swap':7,'sub':11,'store':15}[word])
You can also use the index of a word itself as an index into a list of all the bytecode values:
for word in ['push','pop','dup','sub']: print([0,1,3,7,11,15]['push pop dup swap sub store'.split().index(word)])
We could also look at unique prefixes of the words:
push pu pop po dup du swap sw sub su store st
…and calculate the appropriate indexes from those:
for word in ['push','pop','dup','sub']: print([0,1,3,7,11,15]['pupoduswsust'.index(word[:2])//2])
These examples so far are perfect hash functions. First of all, they are hash functions because they turn a string input into a numeric output. But because there is exactly one value for each key and no collisions, we can call them perfect hash functions.