Two Bytes: A Space-Conservative History of Characters in Computer Programming
On my journey into the strictly typed world of Java, I have begun to understand the role of memory in computer programming. I dug into the definition of a bit, which helped me understand why a byte, at 8 bits, can hold 256 unique values, while an int, at 32 bits, can hold over 4 billion. What I questioned, however, was why a char needed 16 bits of memory. The alphabet only has 26 letters in it — what could we possibly do with 65,536 (2¹⁶) unique values?
The short answer is that there are people who want or need to use characters not contained in the basic Latin alphabet. The long answer will take us back about 220 years, to a time before computers.
Giving letters numeric values, or character encoding, goes at least as far back as human communication via machine: the telegraph. Morse code, introduced in the 1840s, used combinations of four “symbols” to communicate messages and in 1870, Émile Baudot introduced Baudot code, the first machine language to use binary symbols. Baudot’s 5-bit encoding for Latin characters was standardized as International Telegraph Alphabet №2 (ITA2) in 1930.
Over the next 30 years, the US Military iterated on Baudot’s code with the goal of improving consistency and compatibility. Their version, called Fieldata code, was finalized in 1959, but we don’t remember much about it today because the American Standard Association’s X3.2 subcommittee released a much simpler and more durable code four years later. That code, called the American Standard Code for Information Interchange, or ASCII, set the stage for the next 20 years of character encoding.
The subcommittee settled on 7 bits to represent 95 unique printable characters and 33 control codes (128 total combinations). Characters included digits 0–9, upper and lowercase letters of the Latin alphabet, punctuation, mathematical and other symbols. Control codes apply to teletype machines rather than computers and are mostly obsolete now, but some, such as ESC, do live on.
All was well — computer programmers had a standard for encoding characters that fit neatly into one byte and left them with an extra bit of space to boot. But that extra space revealed a myriad of possibilities that didn’t do much to uphold the Standard in American Standard Code for Information Interchange.
This article, by Joel Spolsky, does an excellent job of explaining the problems the spare bit caused. With the 128 ASCII combinations spoken for, there were 128 more unspoken for. And seemingly every programmer had a different idea for which symbols to add. Some decided on lines for drawing, others on characters that existed in their native language, such as Greek letters, Hebrew letters, or letters with accent marks. It got to the point where one file might be totally illegible if switched from one computer to another, even within the same country and the same language.
The ANSI (American National Standards Institute) standard solved this problem by establishing codepages for different languages, countries, and operating systems. If I’m in Russia, I’ll stick to one specific ANSI codepage, which includes Russian characters, and any other computer set to the same code will be able to read my writing. Of course this didn’t solve the problem of wanting to read two languages on one computer, not to mention the Han alphabet and its descendants (Chinese Japanese, Korean, and many other languages throughout Asia), which cannot be contained by 256 unique spaces.
My question has been answered: 8 bits is woefully inadequate to store the wide variety of characters used throughout the world. But Sun Microsystems didn’t invent the 16-bit character when they wrote Java — that work had already been done for them. This story would not be complete without mention of the true hero of character coding: The Unicode Consortium.
The concept of Unicode dates back to at least 1988, when Joe Becker, an engineer at Xerox, published a draft proposal for an “international/multilingual text character encoding system.” Xerox had developed and maintained their own character code standard, called XCCS, since 1982. In his proposal, Unicode 88, Becker wrote “[t]he name ‘Unicode’ is intended to suggest a unique, unified, universal encoding”.
The first volume of the Unicode standard was published in 1991. Its mark, besides being the universally accepted method for representing characters in computing, is that it uses hexadecimal code pointers to organize these characters. Rather than thinking in binary code like we saw with ASCII, the Unicode pointer for A would be U+0041.
This allows Unicode to break free from bit limits and give users the option of over one million characters. English-speaking space-savers can use the ever-popular UTF-8 (Unicode Transformational Format), which defaults to 8 bits per character and only expands when going far beyond the ASCII selection. Others, such as the Chinese government, have created their own UTFs. The government standard in China is to use a coded character set called GB18030, which supports Unicode code points for both simplified and traditional Chinese.
As always, there is a tremendous amount of detail and computer science left to explore on this subject. If you are brave, check out some of the more scholarly sources and don’t hesitate to reach out if you have time to explain them to me.
- Supplementary Characters in the Java Platform by Norbert Lindenberg and Masayoshi Okutsu
- An in-depth look at Java’s character type by By Chuck Mcmanis
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
- Unicode Character Charts
- Wikipedia’s overviews of character encoding, Unicode, and ASCII