I was investigating an error with a service that parses a CSV file. The job failed to complete and, when I reviewed the logs, nearly every row of the file was logged as having invalid characters. When I read the file myself, it looked fine, so I was very confused.
So I consulted a teammate, who showed me how to alter my Sublime Text setup to display the file’s encoding method:
He explained that our service expects files to be encoded in UTF-8 and that we were seeing invalid characters because of the mismatch here. We brought this up in a larger meeting with the stakeholder who originally raised the issue and my teammate said a dangerous set of words regarding LE with BOM: “I don’t know what this means, but we don’t support it.”
Hearing this, our Staff Engineer offered a brief Computer Science lesson and I immediately realized that this was the perfect subject for a blog post.
Unicode
Our goal is to understand “UTF-16 LE with BOM,” so let’s start with the U. Unicode is a text encoding standard designed to support digital text. Though the english alphabet has 26 letters, we know that there are thousands of possible characters that need to represented after we factor in all the world’s languages, mathematical symbols, emojis, and more. Unicode uses binary encodings to ensure that these characters are consistently represented across systems.
UTF stands for Unicode Transformat Format and can encode characters using 8 or more bits. The more bits we use, the more possible combinations and therefore more possible characters we can account for. So when our client sent us a UTF-16 file, it meant that the text was encoded using 16 bits. Reading further, we see that it uses BOM, so we know that it will include a byte order mark for byte-order detection.
Byte Order Mark (BOM)
A byte order mark can look like this:
U+FEFF ZERO WIDTH NO-BREAK SPACE
It will appear at the start of a text stream to indicate:
- The byte order (or endianness) of the text stream
- that the text stream’s encoding is Unicode
- which Unicode encoding is used
Reading the setting “UTF-16 LE with BOM,” we can already infer 2 and 3; we know that we’re using Unicode and that the encoding is Unicode 16. But we haven’t covered byte order yet.
FEFF
is an indication of the order of our encoded bytes. When FE comes first, we know that the encoded bytes are in big-endian order. This means that the system stores the most significant byte of a word at the smallest memory address and the least significant byte at the largest. You can read more about byte significance here, but the gist is that “endianness” determines how our encoded bytes must be read.
Of course the most engaging part of this whole lecture was the fact that “big endian” and “little endian” were inspired by Jonathan Swift’s Gulliver’s Travels. In the story, the fictional people of Lilliput traditionally broke boiled eggs on the broad side (the big end), but a few generations ago, their Emporer decreed that all boiled eggs should be broken on the small side (little end). Of course the Lilliputians were too opinionated to align on a single method and split into big-endians and little-endians (a reference to real-life religious disagreements in Great Britain).
Getting back to the code, big-endian vs little-endian can be indicated by the notation BE or LE. In our case, we’re seeing UTF-16 LE with BOM, which means the text is encoded with 16 bits, little-endian (the least-significant byte of the word is at the smallest address), as indicated by the byte order mark. And when a computer reads this, it will understand how to decode the text!
Fun at Work
I wasn’t sure how to react when my teammate started the aside that would take us off-track from the conversation at hand. Everybody is always very busy and their time is precious, so a fun aside in a meeting might not always be welcome. But I’m grateful that we all seemed to enjoy the lesson and it didn’t stop us from addressing the issue afterward. Personally, I’m not interested in any workplace where we can’t have a little fun like that.
Sources
- Unicode, Wikipedia
- Endianness, Wikipedia
- Bit significance and indexing, Wikipedia
- Lilliput and Blefuscu, Wikipedia