Two Bytes: A Space-Conservative History of Characters in Computer Programming

On my journey into the strictly typed world of Java, I have begun to understand the role of memory in computer programming. I dug into the definition of a bit, which helped me understand why a byte, at 8 bits, can hold 256 unique values, while an int, at 32 bits, can hold over 4 billion. What I questioned, however, was why a char needed 16 bits of memory. The alphabet only has 26 letters in it — what could we possibly do with 65,536 (2¹⁶) unique values?

The short answer is that there are people who want or need to use characters not contained in the basic Latin alphabet. The long answer will take us back about 220 years, to a time before computers.

Telecommunications

Ironically, this Baudot keyboard had no means of communicating the first letter of its inventor’s name: an accented É.

Over the next 30 years, the US Military iterated on Baudot’s code with the goal of improving consistency and compatibility. Their version, called Fieldata code, was finalized in 1959, but we don’t remember much about it today because the American Standard Association’s X3.2 subcommittee released a much simpler and more durable code four years later. That code, called the American Standard Code for Information Interchange, or ASCII, set the stage for the next 20 years of character encoding.

ASCII

The first four bits appear horizontally in a row and the final three in the upper-most cells on the chart.

All was well — computer programmers had a standard for encoding characters that fit neatly into one byte and left them with an extra bit of space to boot. But that extra space revealed a myriad of possibilities that didn’t do much to uphold the Standard in American Standard Code for Information Interchange.

Chaos Reigns

The ANSI (American National Standards Institute) standard solved this problem by establishing codepages for different languages, countries, and operating systems. If I’m in Russia, I’ll stick to one specific ANSI codepage, which includes Russian characters, and any other computer set to the same code will be able to read my writing. Of course this didn’t solve the problem of wanting to read two languages on one computer, not to mention the Han alphabet and its descendants (Chinese Japanese, Korean, and many other languages throughout Asia), which cannot be contained by 256 unique spaces.

My question has been answered: 8 bits is woefully inadequate to store the wide variety of characters used throughout the world. But Sun Microsystems didn’t invent the 16-bit character when they wrote Java — that work had already been done for them. This story would not be complete without mention of the true hero of character coding: The Unicode Consortium.

Unicode

The concept of Unicode dates back to at least 1988, when Joe Becker, an engineer at Xerox, published a draft proposal for an “international/multilingual text character encoding system.” Xerox had developed and maintained their own character code standard, called XCCS, since 1982. In his proposal, Unicode 88, Becker wrote “[t]he name ‘Unicode’ is intended to suggest a unique, unified, universal encoding”.

The first volume of the Unicode standard was published in 1991. Its mark, besides being the universally accepted method for representing characters in computing, is that it uses hexadecimal code pointers to organize these characters. Rather than thinking in binary code like we saw with ASCII, the Unicode pointer for A would be U+0041.

This allows Unicode to break free from bit limits and give users the option of over one million characters. English-speaking space-savers can use the ever-popular UTF-8 (Unicode Transformational Format), which defaults to 8 bits per character and only expands when going far beyond the ASCII selection. Others, such as the Chinese government, have created their own UTFs. The government standard in China is to use a coded character set called GB18030, which supports Unicode code points for both simplified and traditional Chinese.

As always, there is a tremendous amount of detail and computer science left to explore on this subject. If you are brave, check out some of the more scholarly sources and don’t hesitate to reach out if you have time to explain them to me.

Sources:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store