What I Learned at Work this Week: File Signatures and Hexadecimal Encoding

Mike Diaz
8 min readDec 8, 2024

--

Photo by Magda Ehlers: https://www.pexels.com/photo/brown-numbers-cutout-decors-1329295/

Loyal readers know that I’ve done a lot of work on a product that generates reports in a CSV format and sends them to an SFTP. But my team also manages a service that works in the opposite direction, ingesting data from CSVs that our clients put into an SFTP. This week, I was asked to investigate why that service was failing to process data that was compressed in .zip format. As these simple tasks so often do, it sent me down a rabbit hole that resulted in a Computer Science lesson.

Compression

To get some context out of the way, the files in question were in the ZIP format, meaning they had been compressed. The format uses an algorithm to transform data from a file or files and make it smaller, often for ease of transfer. ZIP is a popular compression format, but there are many different ones that use many different compression algorithms.

When we’re processing these files, we first have to decompress them, which means we need to know which compression type we’re dealing with. To solve that problem, we can turn to the file signature. File signatures are also called “magic numbers” or “magic bytes” because they are the first few bytes of any file. These bytes can be decoded to specify the file type we’re dealing with and therefore guide us in how to read it. Unsurprisingly, there was a method in my team’s service that verified the file type:

public static boolean isGZipped(File file) {
int magic = 0;
try {
RandomAccessFile raf = new RandomAccessFile(file, "r");
magic = raf.read() & 0xff | ((raf.read() << 8) & 0xff00);
raf.close();
} catch (Throwable e) {
LOG("isGZipped check failed");
}
return magic == GZIPInputStream.GZIP_MAGIC;
}

Hexadecimal Bit Shifting

This method accepts a file and returns a boolean describing whether or not it is GZipped. GZIP is a file format that compresses files, just like ZIP, but it uses a distinct algorithm and has its own magic number. That number, according to Wikipedia, is 1F8B. The first thing I needed to understand was that this number is hexadecimally encoded. Before we go any further, let’s examine what that means.

The most widely-understood numeric system among humans on earth is a “decimal” number system, “decimal” indicating that it’s organized by 10s. There are 10 characters (0–9), each representing one value, and when we run out, we combine the characters to represent a larger value. If we break down each digit, we know that some multiplication is required to determine values. For example, 11 is not 1 + 1, but 10 * 1 + 1. We understand that there is a “tens” column, a “hundreds” column, etc. Another way to think of it is that when we add a column, we’re increasing the value exponentially, or we’re multiplying the value of each column by 10^x. For example:

1 is 1 // (1**0) * 1 = 1
10 is 10 // (10**1) * 1 + (10**0) * 0 = 10
24 is 24 // (10**1) * 2 + (10**0) * 4 = 24
5060 is 5,060 // (10**3) * 5 + (10**2) * 0 + (10**1) * 6 + (10**0) + 0 = 5,060

Of course, there’s also the binary system, which only has two characters: 1 and 0. Binary notation is often represented in sets of 4 or 8 characters. Each column can be interpreted by taking the base number (2, because we’ve got 2 options in binary) and applying a continuously increasing exponent. 1111, for example, could also be interpreted as 2³ + 2² + 2¹ + 2⁰. The more digits we have, the higher the exponent. If there’s a 0 instead of a 1, we don’t count that space:

0000 is 0
0001 is 1 // 0 + 0 + 0 + 2**0 = 1
0010 is 2 // 0 + 0 + 2**1 + 0 = 2
0011 is 3 // 0 + 0 + 2**1 + 2**0 = 3
1101 is 13 // 2**3 + 2**2 + 0 + 2**0 = 13

Hexadecimal notation works the same way, but with a base of 16. The characters are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. To illustrate:

0001 is 1 // 0 + 0 + 0 + (16**0) * 1 = 1
0009 is 9 // 0 + 0 + 0 + (16**0) * 9 = 9
000A is 10 // 0 + 0 + 0 + (16**0) * 10 = 10
000F is 15 // 0 + 0 + 0 + (16**0) * 15 = 15
0010 is 16 // 0 + 0 + (16**1) * 1 + 0 = 16
ABCD is 43,981 // (16**3) * 10 + (16**2) * 11 + (16**1) * 12 + (16*0) * 13 = 43,981
// 40960 + 2,816 + 192 + 13 = 43,981

Why is this relevant? Because we have a hexadecimal magic number that represents a gzip file: 1F8B. And in order to determine the file type that’s causing the error, we have to understand this method, which is manipulating and comparing the bytes in question.

Bit Shifting

The return value of the method compares the variable magic, which represents the magic number we’ve derived, to a constant called GZIP_MAGIC. This is defined by the GZIPInputStream Java class, so we should be able to rely on it and we wouldn’t want to change it in any way. But for some reason, it represents the gzip magic byte in reverse. Instead of 1F8B, it’s 8B1F. So if we directly compare the magic number of a gzip file to this constant, we’ll return false. To address that issue, we just have to switch around the bits.

Unsurprisingly, the key here is this line:

magic = raf.read() & 0xff | ((raf.read() << 8) & 0xff00);

This method uses three key operators:

| // the OR operator
& // the AND operator
<< // the LEFT SHIFT operator

We can use these to verify or transform bit values through logical operations. Thinking back to binary, we can consider the two possible values to be booleans: 1 is true and 0 is false. If we compare two bits using OR or AND, we can get a new value:

0 | 0 // 0
0 | 1 // 1
1 | 0 // 1
1 | 1 // 1

OR means that if either side is true, the result is true. AND means both have to be true:

0 & 0 // 0
0 & 1 // 0
1 & 0 // 0
1 & 1 // 1

And we can do it for bytes, comparing column-to-column:

0000 | 1111 // 1111
0101 | 1100 // 1101
1100 & 0011 // 0000
1110 & 0011 // 0010

But in the method, we’re not seeing 1s and 0s, we’re seeing hexadecimal notation, as indicated by the leading 0x. In Java, if we see 0xff, that’s hex for FF. Likewise, 0xff00 is FF00 (capitalization doesn’t matter, I’ve just been using it to denote hex so I’m trying to stay consistent). How do comparative operators work when we’re dealing with hex? The same way, we just have to convert them to binary!

FF = (16**1) * 15 + (16**0) * 15 = 16*15 + 1*15 = 240 + 15 = 255
To get 255 in binary, we write:
11111111, or 2**7 + 2**6 + 2**5 + 2**4 + 2**3 + 2**2 + 2**1 + 2**0
or 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255

255 in decmial and FF in hex represent 11111111 in binary, which, in terms of comparison, means “all true.” In the method, we see raf.read() & 0xff, so we’re comparing the first eight bits of the result of read to “all true.” When you use & to compare any value to true, you’ll get that value back, right?

false & true = false
true & true = true
0 & 1 = 0
1 & 1 = 1
10101100 & 11111111 = 10101100

So the first thing we’re doing here is just identifying the first byte of the file. It’s very likely that the & here isn’t necessary at all, since it’s simply deriving the same values that are already there. The author was just being cautious or explicit.

We do something similar after the |, but we use a left shift operator, <<. This means that we’re taking 8 bits and shifting them to the left.

// If the first two bytes look like this
1111 0000 1010 0101
// << 8 will grab 1010 0101 and move it to the front, resulting in
1010 0101 1111 0000

We then use & again to compare, but this time we’re comparing to ff00. We know that FF gives us all 1s or all true, and so when we look to 00, it’s clear — that’s all 0s or all false. If anything is compared to false with an AND, it’s also false:

1 & 0 is 0
0 & 0 is 0
1010 & 0000 is 0000

So the 8 bits we shifted will be maintained and the 8 other bits will all be changed to 0. To illustrate

// If this is a gzip file, the magic number will be 1f 8b
// 1f in binary is 0001 1111
// 8b in binary is 1000 1011
// Assuming this is indeed a gzip file,
// we would expect the result of raf.read() to be 0001 1111 1000 1011
// raf.read() & 0xff would therefore return 0000 0000 0001 1111
// We front-load zeroes so this value will match the one we are comparing it with
// ((raf.read() << 8) & 0xff00) would return 1000 1011 0000 0000
// There are eight zeroes on the end because of the 00 in 0xff00
// If we put these on their assigned sides of the OR:
0000 0000 0001 1111 | 1000 1011 0000 0000
// We get
1000 1011 00001 1111

The order is reversed, and it matches the constant!

Once I understood this, it was easy to see that the file in question wasn’t a gzip file and therefore couldn’t be processed by this method. To solve the client’s problem, we’ll have to implement a similar decompressor that handles ZIP files.

Math

I didn’t take any math classes in college and I wasn’t required to study any math in boot camp. I think that, if I had tried to study Computer Science on an academic level at any point, I would have refused to become a software engineer because these concepts are extremely difficult for me to understand. Only after several reassurances that this math wasn’t required was I even willing to try later in life.

Now that I have the job, I’m of course open to attempting some of these “basic” concepts, though I’m extremely grateful that we have frameworks and libraries that largely abstract them. Even in this case, one could argue that the “better” code would be to import a library that checks the file rather than writing something custom that will confuse less knowledgable developers, like me.

Sources

--

--

Mike Diaz
Mike Diaz

No responses yet