What I Learned at Work this Week: pandas Asserts Floats

Mike Diaz
6 min readAug 14, 2022
Photo by Toni Cuenca: https://www.pexels.com/photo/pink-flamingo-inflatable-ring-and-green-inflatable-ring-on-swimming-pool-565999/

Python is my life now. Let that be a lesson to all the boot camp grads (like me) out there: I had never written a single line of Python code when I started my current job. Since then, I’ve taken one Udemy Python course and mostly learned through random docs and osmosis. I still don’t have a classical knowledge of Python (if you asked me to create a class, I’d freeze), but I’m pretty comfortable with it! Key takeaway: it’s okay if you don’t know everything when you start a journey. You’ll never know everything.

Another gap in my Python knowledge is putting data into CSV files. Loyal readers may have followed some of my stumbles through this process and won’t be surprised to learn that I don’t find it intuitive. One exception to that is pandas, which I endorse for ease-of-use in creating datasets, editing them, and dropping them into a CSV.

With that said, I ran into a really pesky situation this week with how pandas' to_csv formats data.

The setup

I’m sometimes hesitant to recommend following along at home because setting up a new project can be tedious and buggy. In this case, the code is so simple that it might be worthwhile. Let’s start here:

import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']
df = pd.DataFrame(data, columns=columns)df.to_csv('test.csv', index=False)

And as always let’s break this down:

  • Line 1: importing pandas and giving it an alias
  • Line 2: We use pandas to store and manipulate data that came from a database table or would fit into one. Here in our code, we’re defining data as a list of lists. Think of each list as a row in our table. In this case, we just have one since we want a simple example.
  • Line 3: We’ll name our columns so that we understand what the data means. The first element from data is a product name, the second is a purchaser ID, and the third is the price.
  • Line 4: We finally get to use pandas! The DataFrame method can accept a list, dictionary, or iterable and will return…a DataFrame! The pandas documentation describes a DataFrame as a Two-dimensional, size-mutable, potentially heterogeneous tabular data. It’s basically a table that pandas functions will let us easily manipulate. One amazing thing about pandas is that it can take these various data types and turn them all into a DataFrame without issue. Once we’ve got a DataFrame, we’ve got a bunch of functions we can run with it.
  • Line 5 turns our DataFrame into a CSV! The first argument is the name we’ll give to the CSV and the second just tells pandas that we don’t want index numbers in our file. There’s a lot more we can do with to_csv, but we’ll stick with this for now.

The issue

If we run this code, it’ll generate a CSV file that will end up in the same directory we used to run the command:

pandas tries to maintain the data types it receives when writing the CSV. purchaser_id is an int, so it writes an int. price is a float, so we maintain that as well. But in my experience, it’s safest to standardize the data in your CSV so that it isn’t altered when being read by different programs. So we’ll alter our code to wrap everything in the CSV in quotes, so they’ll end up as strings:

df.to_csv(‘test.csv’, quotechar=’”’, index=False, quoting=1)

This will work great, assuming that pandas is putting quotes around data in the format we like. But there’s a quirk to the library that could throw a wrench in our plans. What if we add another row to our DataFrame which, for some reason, doesn’t have a product_name:

df.loc[1] = [None, 123, .89]

The loc method allows us to access a row in our DataFrame. In this case, we’re using it to access index 1, which doesn’t exist yet, and we’re giving it a list which we want to populate that row. When we look at our CSV this time, we’ll notice something unusual:

purchaser_id, which started out as an int, now has a trailing .0. It’s been cast to a float! Don’t believe me? We can use another pandas method to confirm that:

After a lot of searching, I found this Stack Overflow question, which provides a great example of this phenomenon. We don’t get an answer to why int64 changes to float64 when a str row receives an empty value, but we can now recognize a pattern that causes unexpected behavior and there are a few ways we can address it.

The Blanket

In our case, this really only matters when we’re finally dropping our results into our CSV. to_csv includes an argument that allows us to auto-format floats so we can decide how many decimal places to display. It’s just as simple as this:

df.to_csv('test.csv', quotechar='"', index=False, quoting=1, float_format='%.0f')

This means that all of our floats will display zero decimal places. Here’s the result:

Ah…this is a problem because we’ve lost the cents in our price column. float_format's downside is that it affects all the float columns in your DataFrame, so it won’t quite work in this example. Alternatively, we can selectively format and sanitize our one problem column:

df['purchaser_id'] = (
df['purchaser_id'].astype(str).replace('\.0', '', regex=True)
)

If we cast the type from float to str, we can use replace to pull out the .0. In this simple example, this works!

The struggle continues

Rather than converting data back to what it was before the unexpected behavior, we might want to try preventing that behavior before it ever occurs. The issue here stems from pandas having confusion regarding our int64 column. We can set the data type of that column right off the bat:

df = pd.DataFrame(data, columns=columns, dtype=object)

Whatever data is stored in the DataFrame here will be displayed in its original format, regardless of any None values. But if we use loc to add something, it’ll still suffer from the same issue:

first row :D, second row D:

I did a bunch of searching and experimenting and could not find a way to add a new value to the DataFrame without seeing the .0 in that new row. We can once again remove it with the replace function, but setting the column type as object did not save us.

Since I actually have to come up with a solution to this problem at work (where the replace method isn’t working for some reason), I think the best strategy is to regenerate DataFrames with full sets of data rather than trying to add info:

import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']
df = pd.DataFrame(data, columns=columns, dtype=object)
# The purchaser_ids in this df do not include .0, but if we add another row, we may end up with one.
# convert the df back to a list:
df_values = df.values.tolist()
# update the list
df_values.append([None, 123, .89])
# make a new df now that our list contains all the data we want
new_df = pd.DataFrame(df_values, columns=columns, dtype=object)
new_df.to_csv('test.csv', quotechar='"', index=False, quoting=1)

Finally, we have a CSV with the formatted data we’re looking for.

Sources

--

--