Endianness

What is "endianness", and what are some examples?

Etymology

The term comes from the novel Gulliver's Travels, to distinguish between two types of people, based on whether they eat an egg starting at the "big end" or the "little end". While the word was originally clearly meant to point out the ridiculous ways in which we humans create tribal divisions among ourselves, it has come to describe a useful mathematical concept of ordering a sequence of values.

In Computing

In computer systems, byte order is one thing which can have an "endianness" property. Bytes (groups of 8 bits) can represent 256 different values (00000000, 00000001, 00000010, etc.) such as the range 0 to 255 or the range -128 to +127. But most computers natively support values of greater than a single byte, so it's necessary to put a group of bytes in some consistent order.

The order that maps directly to Arabic numerals is "big endian" (BE). A BE number has its most significant bytes written on the left, and its least significant bytes on the right.

For example, the integer value 2050 can be written in hexadecimal format (0x802) or as a sequence of 2 hex values (0x08 0x02), or as two 8-bit sequences:

      0x08             0x02
┌───────────────┬───────────────┐
│0 0 0 0 1 0 0 0│0 0 0 0 0 0 1 0│
└───────────────┴───────────────┘

But we could alternatively write this in a "little endian" (LE) format:

      0x02             0x08
┌───────────────┬───────────────┐
│0 0 0 0 0 0 1 0│0 0 0 0 1 0 0 0│
└───────────────┴───────────────┘

As long as the system is consistent, storing in either format is valid.

There may be good reasons to prefer one or the other, but if you must interface two systems, it's best if they both "speak" the same Endianness - otherwise you have to keep track of which one you're using, and convert back and forth between the two formats.

Domain Names

In a domain name, the top-level domain (TLD) could reasonably be considered the "most significant" part of the name. The hostname comes before the TLD, and is probably the next most significant part. Subdomains come before the hostname, separated by dot characters, and are less significant than the hostname. However, pages within a domain or subdomain come after the TLD, and are separated by (usually) slashes.

For example, if we write a generic URL and try to number its logical components from most significant (0) to least (4), we might come up with this:

proto://subdomain.domain.tld/page/subpage
            2       1     0   3    4
            v-------<-----<
            |
            >----------------->---->

If this counts as an example of "endianness", then it's a strange one, due to the spiral pattern.

File Paths

On Linux, MacOS, and Windows, the locations of specific files in a filesystem are given as "paths", and generally in BE format:

  • /usr/bin/python3 where /usr is higher in the hierarchy
  • C:\Documents\letter.docx where C: is higher in the hierarcy

In hierarchical filesystems like these, the subfolders are written to the right of their containing folder, and individual files (not folders) are listed last.

Outside Computing

This concept may seem at first to only exist in this tiny computing-specific niche. However, it actually appears in many different domains.

Dates

For example, dates are written in different formats depending sometimes on country and/or culture:

yyyy-mm-dd China, Japan BE
mm-dd-yyyy United States middle-endian (ME)
dd-mm-yyyy Parts of Europe LE

People's Names

Most people I know (reflecting my cultural biases) have 2 or 3 names. Their names are written left to right, with the leftmost one called the "first" name, and the rightmost called the "last" name. Stretching the definition of "significant" to include "formality", the last name would be the most formal/significant, and so in the US, names are written in little endian format.

In some cultures, it is common to have more than one "last" name, and yet rarely use all of these names at once. For example, someone may have one last name from each parent, and use both of these in "official" correspondence like birth certificates, but only use one in most other situations.

Other cultures use a more big endian ordering for their names, which can cause confusion with international teams, or whenever you try to make software that tries to make sense of human names.

Places

Again depending on culture, the names of places may be written in a particular BE or LE style. The style I see used most often is LE.

For example:

  • Bloomington, IN (city, state)
  • Bloomington, IN, USA (city, state, country)
  • Oslo, Norway (city, country)

City names are often reused in different regions, so only mentioning "Bloomington" or even "Bloomington, USA" is ambiguous. However, some especially large or famous cities (New York City, London, Tokyo) are globally unambiguous, because people will tacitly assume you mean the most famous one.

Which is Better?

In theory, neither; in practice, it depends.

My opinion is that Little Endian is preferred in situations where the global context can be easily inferred. For example, where I work there is a "campus mail" system. Within this system, the planet, country, state, city, and approximate geographic location are all implied, so there is no reason to list any of them. However, sending mail outside of this system requires more specificity.

One argument for little endian is that if you need to change between campus mail and worldwide mail, the only necessary change is to add a suffix to specify the state/country. This reduces the burden of writing addresses because they become easier to change.

A very niche argument for little endian is that prefixes are easier to search for than suffixes. So in systems like Integrated Development Environments (IDEs) naming variables with a common prefix will result in more false positives and require more typing, compared to if you name variables with a common suffix.

So names like these:

  • widgetController
  • widgetConstructor
  • widgetRemover

Are harder to search for compared to names like these:

  • controllerForWidget
  • constructorForWidget
  • removerForWidget

Granted, encoding this kind of information in variable names is probably not the best idea, but not all systems have namespaces or advanced type systems that more elegantly solve these problems.

Big Endian seems to be a better match for domains where speed is more important than accuracy, or where accuracy is more important than precision. For example, if you want to approximate the sum of a set of numbers, you might pay more attention to the leading digits (and the length of the number) than the trailing digits. Similarly, if you transmit information over a communication channel, it's good to put the important information first, in case the message is truncated or garbled over time.

This strategy is employed in subject lines in emails, headlines in newspapers, and abstracts in scientific writing.