What I Learned at Work this Week: Pandas Data Conversion Quirks

Mike Diaz
7 min readJul 3, 2022

--

Photo by Ann H: https://www.pexels.com/photo/yellow-jigsaw-puzzle-piece-3482442/

I’m getting better at my job. This week, I got a request to update the data formatting on a report that we generate to match the format of data elsewhere. I knew exactly where to find the SQL that generated the data and it didn’t take much time to update the timestamp with a couple of to_char functions inside of a CONCAT. There was only one small adjustment left.

The client was having trouble with automated ingestion of the data because one source appended a + to the beginning of phone numbers while the other didn’t. I figured this would be even simpler than the date — just use another CONCAT to add the symbol before the number. But when I ran the query I saw that…the + was already there before I made any changes. Maybe I’m not as good as I thought.

Next Steps

My initial assumption was that the data the clients were receiving was exactly what was generated from the SQL. Since I was now seeing a discrepancy, I dug a little bit deeper to see if the data was transformed at all before it got to the client. Of course, I learned that there were two key transformations that explained the issue:

My company offers a standard report service that will send a .xlsx file to a given email address on-demand or on a scheduled cadence. Sometimes this data is too sensitive for email, so clients ask that it be sent to an SFTP instead. We have a script for that! When our report service generates a file to send to a client, it stores that file in AWS s3 for reference and lists the file’s location in a report_queue DB table. We took advantage of that structure to write the SFTP-send script, which checks the same table, finds the file, and then uses boto3 to drop it into AWS. But if we’re just moving the file, why would the data change?

The script has some additional functionality if the client wants us to send the data to a different location, encrypt it, or change the file’s format. .xlsx is great as a human-readable/editable format, but if we want to import raw data into another program, a csv is preferable. Let’s recreate the code that facilitates this change to better understand why we lose the +.

The Data and the Script

To start, I created a sample .xlsx file with two columns: name and phone. This should be pretty simple so feel free to follow along at home:

Note that, in order to get the + into the cell, I had to manually format it as “Text.” Otherwise, Numbers (and probably Excel if you’re on Windows) will interpret it as a number and remove the character. In our production code, the data is formatted, queried, and written as strings, so this mimics that. Next comes the Python:

import pandas as pd
import io
DESKTOP_PATH = '/Users/michaeldiaz/Desktop/'
FILENAME = 'sample.xlsx'
file = open(DESKTOP_PATH + FILENAME, 'rb')
excel_bytes = file.read()
all_sheets = pd.read_excel(io.BytesIO(excel_bytes), sheet_name=None)
sheets = all_sheets.keys()
for sheet_name in sheets:
sheet = pd.read_excel(
io.BytesIO(excel_bytes),
sheet_name=sheet_name
)
sheet.to_csv(desktop_path + sheet_name)

As usual, we can go through the code one line at a time to make sure we really understand what’s happening. The first two lines are imports. I’ve done a little intro to pandas in the past, but for now the main thing to know is that importing pandas is going to help us read and write table-like data in Python. The io import is a module for handling data in streams. I/O stands for In and Out, which makes sense to me since we’re giving our script a file and expecting it to output something different. We’ll see how this module is used shortly.

To start, I defined some constants. For the purposes of a quick example, I want to run everything through my desktop, so I figured out the path for that, added the name of my input file, and used open to give the Python script access to the data inside. Note that the second argument in open is rb. Originally I tried this with just r, which stands for read. The read operation line would work, but my script would fail when I actually tried to run read on the results. Here’s my error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 10: invalid start byte

There was something in my file that couldn’t be interpreted with utf-8. I never did figure out what it was, but I learned that using rb, which reads the file as binary, doesn’t have this restriction. When I printed the results, they looked like this:

The incomplete binary string for only two rows of data

Running excel_bytes.read() gives us access to all the data in our spreadsheet. Now we can use pandas to turn that binary code into a dataFrame, or a human-readable dictionary for the data. The read_excel method accepts any io as its first argument, which is why we’re using the io module to transform our bytes into bytesIO. sheet_name declares which sheet or sheets from the source file will be read. Remember that excel files can contain multiple spreadsheets, usually presented in separate tabs. This parameter defaults to 0, which will read only the first sheet in a given file. Reports in my workplace often have multiple sheets and we don’t want to ignore any, so we instead use sheet_name=None, which will read all the sheets. Here’s the resulting dataFrame:

csvs cannot support multiple sheets like .xlsx can. If we’re potentially dealing with multiple sheets, we’re going to have to create multiple csvs. To account for that, we create a list from the keys in our dataFrame, all_sheets. As we can see above, there’s only one sheet in this case and the key is the name of the sheet, “Phone Numbers.”

We iterate through that array and use two more pandas features. read_excel once again accepts an io of our sheet data, but now we’re specifically reading one sheet (in this case our only sheet). Then we simply use to_csv to turn that dataFrame into a csv. We pass it a parameter of a full path which will tell our script what to name the file and where to put it. If we have our desktop path correct, we’ll have a brand new file waiting for us! And if we open it up, we’ll see that the + is gone from the phone column.

Retaining the Character

If we look at an earlier screenshot, we can see that the + is already gone when we create a dataFrame with read_excel. So what causes it to disappear? read_excel is a powerful method that abstracts a lot of logic, which includes interpretation of data types from a source file. If it interprets the values in the phone column as ints, we’ll lose the +. Instead, we can tell the method that we want the values in that column to be strings:

all_sheets = pd.read_excel(
io.BytesIO(excel_bytes),
sheet_name=None,
dtype={'phone': str}
)

When we print out all_sheets with this change, we see the +!

But when we look at the resulting csv, it’s still missing:

I’ll have to investigate the empty first column in a future post

We use read_excel twice in this script, once to identify the sheets in our file and the second time to actually prepare the data for conversion into csv. But even if I add the dtype argument in both instances, we still lose the +. There’s only one line of code after the second instance of read_excel, so it stands to reason that this is the culprit:

sheet.to_csv(desktop_path + sheet_name)

I expect that we once again want to cast some sort of number (int, long, float) as a string. To find an option, I went to the pandas documentation for to_csv and searched the page for “string.” The first option I found was this:

This makes sense — the phone number is being interpreted as a float and we can use this parameter to make it a string! Let’s try it:

sheet.to_csv(desktop_path + sheet_name, float_format=str)

And the results?

voila!

Simplifying a complex problem

The example here is much more straightforward than what I was reading at work, which gives us a great opportunity to examine the specific parts we don’t understand. Not only have we solved our problem with something that we can apply to the original script, but we can also identify opportunities for refactoring. In this case, we didn’t necessarily need to use the io import after all. It’s true that read_excel can accept an io as its first parameter, but it can also accept a filepath:

sheet = pd.read_excel(
'/Users/michaeldiaz/Desktop/sample.xlsx',
sheet_name=sheet_name,
dtype={'phone': str}
)

This might not work in the original script, which reads from s3 instead of my local machine, but I’m happy I understand the functionality. And I’m looking forward to implementing these new findings in the workplace!

Sources

--

--