What I Learned at Work This Week: Base64 Encoding
Another week as a Solutions Engineer, another term I didn’t understand. While working on some new tickets, I saw base64 written a few times in our codebase and documentation. I was able to infer that it was changing the format of some data to encode it, but I wasn’t sure why or how…which made it the perfect subject for this week’s blog!
What and Why?
Base64 encoding changes binary data into an ASCII string format. This has a myriad of applications, but one easy-to-understand example is embedding images online, where heavyweight HTTP requests can be replaced by actually including binary data for the image in your code. If you’ve ever seen syntax like this before, you might now understand why:
<img src="data:image/png;base64,iVBOR…" />
At work, I was exposed to base64 encoding because Fivetran’s REST API requires API keys and secrets to be base64 encoded in requests. I can’t say why they have this requirement since base64 is easily decoded and is not typically used for encryption. But what also may stand out is that API keys and secrets are generally character strings. So if base64 encoding converts binary to string, how would it work…on a string?
Base64 Encoding of a String
Avid readers of my blog might remember that ASCII originally used 7 bits to represent 128 different letters, numbers, and special characters and commands, but that this didn’t prove nearly enough for handling international and historical characters. Base64 isn’t meant to be read by any human, so it only uses 6 bits for…you guessed it…64 binary combinations. Base64 connects 64 values with the 26 capital letters of the Latin alphabet, 26 lowercase letters, numbers 0–9, and usually a + and a / symbol:
In this image, we can see that the binary values of all the characters have 6 digits (6 bits). This is the key to base64 encoding because the values we’re translating from will contain 8 bits, the native amount of all digital storage and communication.
I’m still not sure how commonly base64 encoding is meant to be used on a string, but my real-life client needed it for their real-life API, so I figure it’s worth exploring. Let’s start with a simple string:
To start, let’s plug this into a base64 encoder to see what we should expect to end up with:
Ahoy => QWhveQ==
We know that base64 encoding works with binary data, so we’ll first have to translate these ASCII characters. So the letters in our string would end up looking like this:
A => 01000001
h => 01101000
o => 01101111
y => 01111001
We’ve converted our string of characters to binary and here it is:
01000001 | 01101000 | 01101111 | 01111001
We’ve got a 32-bit input that was originally made up of octets. To convert it to base64, we’ll translate each segment of the input, but six bits at a time rather than 8. Let’s consult our base64 encoding table to translate the first sextet:
010000 => Q
And we’ll continue along until…
010000 | 010110 | 100001 | 101111 | 011110 | 01
Q W h v e ?
Oh no! We ran out of sextets because we started with 32 bits, which is not evenly divisible by 6. You probably noticed that our base64 ouput’s last three characters were a Q and two = symbols. When dealing with overflow, the encoder fills out the rest of the sextet with 0s, so in trying to translate 01, it actually translates 010000. So what’s up with the =?
Base64 uses the symbol = as a padding character to signify that the 0s used to fill out the code did not come from the original string. We’ll never have an odd number of bits, nor will we ever need more than 4 placeholder 0s, so each = symbol represents one pair/set of 0s. That’s why we see two at the end of our output — we needed two sets of 0s to fill out a sextet. To further exemplify this principle, let’s amend our input by adding a character:
By adding a character, our binary translation now contains 40 bits. 36 of those bits fit neatly into sextets, so we know that our remainder of 4 will need one set of 0s added before translation. That’s represented by one = symbol.
Octets: 01000001 | 01101000 | 01101111 | 01111001 | 00100001
Sextets: 010000 | 010110 | 100001 | 101111 | 011110 | 010010 | 0001
Base64: Q W h v e S E=
Base64 has been used by MIME (Multipurpose Internet Mail Extensions) to add functionality to email communications such as extended character sets and media embedding as well as by spammers to hide keywords in their attacks. Though Fivetran probably wasn’t expecting the encoding to provide any security regarding the API key and secret, it’s possible that they use the string of data in another application that requires the format. The additional reading below provides a ton of additional insight that might help solve the mystery: