Python is my life now. Let that be a lesson to all the boot camp grads (like me) out there: I had never written a single line of Python code when I started my current job. Since then, I’ve taken one Udemy Python course and mostly learned through random docs and osmosis. I still don’t have a classical knowledge of Python (if you asked me to create a class, I’d freeze), but I’m pretty comfortable with it! Key takeaway: it’s okay if you don’t know everything when you start a journey. You’ll never know everything.
Another gap in my Python knowledge is putting data into CSV files. Loyal readers may have followed some of my stumbles through this process and won’t be surprised to learn that I don’t find it intuitive. One exception to that is pandas
, which I endorse for ease-of-use in creating datasets, editing them, and dropping them into a CSV.
With that said, I ran into a really pesky situation this week with how pandas
' to_csv
formats data.
The setup
I’m sometimes hesitant to recommend following along at home because setting up a new project can be tedious and buggy. In this case, the code is so simple that it might be worthwhile. Let’s start here:
import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']df = pd.DataFrame(data, columns=columns)df.to_csv('test.csv', index=False)
And as always let’s break this down:
- Line 1: importing
pandas
and giving it an alias - Line 2: We use
pandas
to store and manipulate data that came from a database table or would fit into one. Here in our code, we’re definingdata
as a list of lists. Think of each list as a row in our table. In this case, we just have one since we want a simple example. - Line 3: We’ll name our columns so that we understand what the data means. The first element from
data
is a product name, the second is a purchaser ID, and the third is the price. - Line 4: We finally get to use
pandas
! The DataFrame method can accept a list, dictionary, or iterable and will return…a DataFrame! The pandas documentation describes a DataFrame as a Two-dimensional, size-mutable, potentially heterogeneous tabular data. It’s basically a table that pandas functions will let us easily manipulate. One amazing thing aboutpandas
is that it can take these various data types and turn them all into a DataFrame without issue. Once we’ve got a DataFrame, we’ve got a bunch of functions we can run with it. - Line 5 turns our DataFrame into a CSV! The first argument is the name we’ll give to the CSV and the second just tells
pandas
that we don’t want index numbers in our file. There’s a lot more we can do withto_csv
, but we’ll stick with this for now.
The issue
If we run this code, it’ll generate a CSV file that will end up in the same directory we used to run the command:
pandas
tries to maintain the data types it receives when writing the CSV. purchaser_id
is an int, so it writes an int. price
is a float, so we maintain that as well. But in my experience, it’s safest to standardize the data in your CSV so that it isn’t altered when being read by different programs. So we’ll alter our code to wrap everything in the CSV in quotes, so they’ll end up as strings:
df.to_csv(‘test.csv’, quotechar=’”’, index=False, quoting=1)
This will work great, assuming that pandas
is putting quotes around data in the format we like. But there’s a quirk to the library that could throw a wrench in our plans. What if we add another row to our DataFrame which, for some reason, doesn’t have a product_name
:
df.loc[1] = [None, 123, .89]
The loc method allows us to access a row in our DataFrame. In this case, we’re using it to access index 1, which doesn’t exist yet, and we’re giving it a list which we want to populate that row. When we look at our CSV this time, we’ll notice something unusual:
purchaser_id
, which started out as an int
, now has a trailing .0
. It’s been cast to a float
! Don’t believe me? We can use another pandas
method to confirm that:
After a lot of searching, I found this Stack Overflow question, which provides a great example of this phenomenon. We don’t get an answer to why int64
changes to float64
when a str
row receives an empty value, but we can now recognize a pattern that causes unexpected behavior and there are a few ways we can address it.
The Blanket
In our case, this really only matters when we’re finally dropping our results into our CSV. to_csv
includes an argument that allows us to auto-format floats so we can decide how many decimal places to display. It’s just as simple as this:
df.to_csv('test.csv', quotechar='"', index=False, quoting=1, float_format='%.0f')
This means that all of our floats will display zero decimal places. Here’s the result:
Ah…this is a problem because we’ve lost the cents in our price
column. float_format
's downside is that it affects all the float
columns in your DataFrame, so it won’t quite work in this example. Alternatively, we can selectively format and sanitize our one problem column:
df['purchaser_id'] = (
df['purchaser_id'].astype(str).replace('\.0', '', regex=True)
)
If we cast the type from float
to str
, we can use replace
to pull out the .0
. In this simple example, this works!
The struggle continues
Rather than converting data back to what it was before the unexpected behavior, we might want to try preventing that behavior before it ever occurs. The issue here stems from pandas
having confusion regarding our int64
column. We can set the data type of that column right off the bat:
df = pd.DataFrame(data, columns=columns, dtype=object)
Whatever data is stored in the DataFrame here will be displayed in its original format, regardless of any None
values. But if we use loc
to add something, it’ll still suffer from the same issue:
I did a bunch of searching and experimenting and could not find a way to add a new value to the DataFrame without seeing the .0
in that new row. We can once again remove it with the replace
function, but setting the column type as object
did not save us.
Since I actually have to come up with a solution to this problem at work (where the replace
method isn’t working for some reason), I think the best strategy is to regenerate DataFrames with full sets of data rather than trying to add info:
import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']df = pd.DataFrame(data, columns=columns, dtype=object)
# The purchaser_ids in this df do not include .0, but if we add another row, we may end up with one.# convert the df back to a list:
df_values = df.values.tolist()# update the list
df_values.append([None, 123, .89])# make a new df now that our list contains all the data we want
new_df = pd.DataFrame(df_values, columns=columns, dtype=object)new_df.to_csv('test.csv', quotechar='"', index=False, quoting=1)
Finally, we have a CSV with the formatted data we’re looking for.
Sources
- Prevent trailing zero with pandas “to_csv”, Stack Overflow
- Table, Computer Hope
- pandas.DataFrame, pandas documentation
- pandas.DataFrame.loc, pandas documentation
- Stop Pandas from converting int to float due to an insertion in another column, Stack Overflow