Armchair Language Design (part 1)

Documenting features I like to see in a programming language.

What is Language Design?

Even though I’ve written a compiler for (a subset of) Racket, and I’ve written an RPN calculator that lets the user assign variables, I don’t consider myself a programming language implementor. I think to consider yourself a PL implementor, you need to also have designed the language. Design is about tradeoffs and choices, and what you include is just as important as what you leave out.

So, what do I think is most important in a programming language? The following are some goals and the associated values or tradeoffs that go along with them.

Ease of Experimentation

Having a bare-bones REPL is better than not having one, and having a really full-featured REPL is even better. While it’s possible to add REPL support to most languages (even C++ has ROOT), the user experience is better if the language itself is designed with quick prototyping in mind.

At a fundamental level, this implies the language prefers terseness to verbosity.

Values

The language should have numbers built-in. This doesn’t have to be as rich as J’s “constants” mini-language, but it should support, at minimum:

  • booleans
  • integers
  • real values
  • ASCII characters
  • strings

This is sufficient for most languages, but I also think times and dates are fundamental enough to practical programming that they should be built-in to the language rather than provided as a library.

Types

Closely related to values is the concept of types, which are a way of constraining what values fit into certain categories. I prefer a numeric hierarchy where booleans are 0 and 1, which means they’re a subset of integers, and integers in turn are a subset of real values. Mixing values at different positions on the numeric hierarchy should “promote when necessary, demote when possible”. The goal of this strategy is to be mathematically correct first and foremost, but also save space if possible.

Complex numbers or higher-dimension numberics (quaternions, octonions), while critical in some fields, are not critical enough to be part of the language, so they should be provided by libraries.

Datetimes could be part of the numeric hierarchy, but their rules are different enough to warrant being separate. I think dates and times should be part of the language itself, to avoid having to convert among conflicting representations.

Textual values are as important as numbers in programming (and maybe more important than dates), so I want this language to painlessly support ASCII characters and strings (1d arrays) of them. ASCII is the lowest common denominator though, and so not just ASCII, but full utf-8 support should be built in. Unfortunately, this is much more complex than ASCII, and some operations which are cheap to perform on ASCII are dangerous or impossible to do across all of unicode. A compromise may be to treat ASCII/UTF-8 similarly to bool/int/real: promote when necessary, demote when possible. But this is outside of my comfort zone, so I fully expect there to be edge cases that complicate this idea.

Composite Types

The most essential composite type is the array. This language should be quite array-oriented, and support the basic operations that users of APL, J, and K take for granted, such as arithmetic operations that work on all mixtures of scalar and array arguments.

Specifically, the language should transparently support, at minimum:

  1 + 2 3 4  NB. scalar + array
3 4 5
  10 20 30 + 10 20 30  NB. array + array
20 40 60

For this language, an array is defined as a collection of 0 or more elements, arranged along 0 or more dimensions, where each element has the same type. Further, these elements are stored contiguously, which means arrays support random access to their elements.

This type of array is sometimes called “flat”, as opposed to “nested”, because it is a single container, rather than a container of containers.

Since the elements are not nested, a list of dimensions fully describes the shape of this type of array. For example, a matrix with 2 rows and 3 columns would have shape (2 3).

This type of array can be represented compactly in memory, and many operations on such arrays can be made fast.

Now, while it’s certainly possible to program using only arrays, there is one more composite data type that makes life a lot more pleasant: associative arrays (also known as dictionaries). This data type should support keys which are (nearly) any value. In Python, keys must be hashable and comparable by value. These restrictions allow the implementation to be reasonably fast, so they’re tolerable. In practice, most people use either strings or integers as dictionary keys, although I have used tuples once or twice. Being able to look up a value based on a programmatically-generated key is extremely useful.

Memory Management

Memory management should be as automatic as possible. I expand the domain of “memory management” to include things like open file handles and database connections, because the real pain of memory management is that you must remember to free these resources when you’re done using them.

While Python has the with keyword and ContextManagers to automatically free resources, Rust is actually much better about this because its lifetime rules can ensure a program which forgets to clean up after itself won’t compile.

Modularity vs. Extensibility

For me, being able to extend the language to suit my particular needs is not as important as being able to reuse functionality across different projects. While I have experienced the lispy epiphany of being able to extend the language with code that’s indistinguishable from the “host” language, I honestly was not all that impressed. To me, being able to write my own functions that look like the built-in functions is not a huge selling point of a language.

On the other hand, I very much appreciate modules that allow me to reuse functionality that someone else wrote. Maybe the best examples of this would be things like xml or html parsing. These are commonly-used formats that I have to interact with sometimes, but they’re also complex formats with lots of edge cases and quirks. I’d much rather use a battle-tested library for parsing these types of data than to attempt to write my own. Granted, I can probably write my own for a particular subset of xml, but I will definitely forget some edge cases that are rare in my experience but not rare in the grand scheme of all xml that I may encounter in the wild.

So while extensibility of the language and modularity are somewhat orthogonal, I think they can be lumped together in some ways because they’re both dealing with adapting to problems beyond what the language was ostensibly designed to solve.

And if I have to pick one or the other, I would pick modularity.

As a consequence of this preference, my preferred language does not need a macro system, hygenic or otherwise. However, it does need a module system for importing (and exporting) code to share between codebases.

Performance

It should be fast.