What I Learned at Work this Week: pandas Asserts Floats
Python is my life now. Let that be a lesson to all the boot camp grads (like me) out there: I had never written a single line of Python code when I started my current job. Since then, I’ve taken one Udemy Python course and mostly learned through random docs and osmosis. I still don’t have a classical knowledge of Python (if you asked me to create a class, I’d freeze), but I’m pretty comfortable with it! Key takeaway: it’s okay if you don’t know everything when you start a journey. You’ll never know everything.
Another gap in my Python knowledge is putting data into CSV files. Loyal readers may have followed some of my stumbles through this process and won’t be surprised to learn that I don’t find it intuitive. One exception to that is
pandas, which I endorse for ease-of-use in creating datasets, editing them, and dropping them into a CSV.
With that said, I ran into a really pesky situation this week with how
to_csv formats data.
I’m sometimes hesitant to recommend following along at home because setting up a new project can be tedious and buggy. In this case, the code is so simple that it might be worthwhile. Let’s start here:
import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']df = pd.DataFrame(data, columns=columns)df.to_csv('test.csv', index=False)
And as always let’s break this down:
- Line 1: importing
pandasand giving it an alias
- Line 2: We use
pandasto store and manipulate data that came from a database table or would fit into one. Here in our code, we’re defining
dataas a list of lists. Think of each list as a row in our table. In this case, we just have one since we want a simple example.
- Line 3: We’ll name our columns so that we understand what the data means. The first element from
datais a product name, the second is a purchaser ID, and the third is the price.
- Line 4: We finally get to use
pandas! The DataFrame method can accept a list, dictionary, or iterable and will return…a DataFrame! The pandas documentation describes a DataFrame as a Two-dimensional, size-mutable, potentially heterogeneous tabular data. It’s basically a table that pandas functions will let us easily manipulate. One amazing thing about
pandasis that it can take these various data types and turn them all into a DataFrame without issue. Once we’ve got a DataFrame, we’ve got a bunch of functions we can run with it.
- Line 5 turns our DataFrame into a CSV! The first argument is the name we’ll give to the CSV and the second just tells
pandasthat we don’t want index numbers in our file. There’s a lot more we can do with
to_csv, but we’ll stick with this for now.
If we run this code, it’ll generate a CSV file that will end up in the same directory we used to run the command:
pandas tries to maintain the data types it receives when writing the CSV.
purchaser_id is an int, so it writes an int.
price is a float, so we maintain that as well. But in my experience, it’s safest to standardize the data in your CSV so that it isn’t altered when being read by different programs. So we’ll alter our code to wrap everything in the CSV in quotes, so they’ll end up as strings:
df.to_csv(‘test.csv’, quotechar=’”’, index=False, quoting=1)
This will work great, assuming that
pandas is putting quotes around data in the format we like. But there’s a quirk to the library that could throw a wrench in our plans. What if we add another row to our DataFrame which, for some reason, doesn’t have a
df.loc = [None, 123, .89]
The loc method allows us to access a row in our DataFrame. In this case, we’re using it to access index 1, which doesn’t exist yet, and we’re giving it a list which we want to populate that row. When we look at our CSV this time, we’ll notice something unusual:
purchaser_id, which started out as an
int, now has a trailing
.0. It’s been cast to a
float! Don’t believe me? We can use another
pandas method to confirm that:
After a lot of searching, I found this Stack Overflow question, which provides a great example of this phenomenon. We don’t get an answer to why
int64 changes to
float64 when a
str row receives an empty value, but we can now recognize a pattern that causes unexpected behavior and there are a few ways we can address it.
In our case, this really only matters when we’re finally dropping our results into our CSV.
to_csv includes an argument that allows us to auto-format floats so we can decide how many decimal places to display. It’s just as simple as this:
df.to_csv('test.csv', quotechar='"', index=False, quoting=1, float_format='%.0f')
This means that all of our floats will display zero decimal places. Here’s the result:
Ah…this is a problem because we’ve lost the cents in our
float_format's downside is that it affects all the
float columns in your DataFrame, so it won’t quite work in this example. Alternatively, we can selectively format and sanitize our one problem column:
df['purchaser_id'] = (
df['purchaser_id'].astype(str).replace('\.0', '', regex=True)
If we cast the type from
str, we can use
replace to pull out the
.0. In this simple example, this works!
The struggle continues
Rather than converting data back to what it was before the unexpected behavior, we might want to try preventing that behavior before it ever occurs. The issue here stems from
pandas having confusion regarding our
int64 column. We can set the data type of that column right off the bat:
df = pd.DataFrame(data, columns=columns, dtype=object)
Whatever data is stored in the DataFrame here will be displayed in its original format, regardless of any
None values. But if we use
loc to add something, it’ll still suffer from the same issue:
I did a bunch of searching and experimenting and could not find a way to add a new value to the DataFrame without seeing the
.0 in that new row. We can once again remove it with the
replace function, but setting the column type as
object did not save us.
Since I actually have to come up with a solution to this problem at work (where the
replace method isn’t working for some reason), I think the best strategy is to regenerate DataFrames with full sets of data rather than trying to add info:
import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']df = pd.DataFrame(data, columns=columns, dtype=object)
# The purchaser_ids in this df do not include .0, but if we add another row, we may end up with one.# convert the df back to a list:
df_values = df.values.tolist()# update the list
df_values.append([None, 123, .89])# make a new df now that our list contains all the data we want
new_df = pd.DataFrame(df_values, columns=columns, dtype=object)new_df.to_csv('test.csv', quotechar='"', index=False, quoting=1)
Finally, we have a CSV with the formatted data we’re looking for.
- Prevent trailing zero with pandas “to_csv”, Stack Overflow
- Table, Computer Hope
- pandas.DataFrame, pandas documentation
- pandas.DataFrame.loc, pandas documentation
- Stop Pandas from converting int to float due to an insertion in another column, Stack Overflow