What I Learned at Work this Week: Pandas Values considered “missing”

Mike Diaz
5 min readSep 25, 2022

--

Photo by Pedro Figueras: https://www.pexels.com/photo/white-framed-glass-window-626158/

The last time I posted, I shared my experience debugging type adjustment with a pandas DataFrame. Around the same time I was writing that article, a friend of mine asked if I could do a presentation for one of our company’s Demo Days. I wanted to help him out and I love attention so I talked about pandas since I had this ready-made story to tell.

It turned out to be a great decision because a few members of the audience immediately recognized my issue and shared some resources for how to solve it! It’s a little embarrassing that the answer was in the pandas documentation all along, but overall I’m just happy to know more about this issue and to be able to share it with anyone else who’s curious.

Values considered “missing”

For those who haven’t read my original article, I was having an issue when creating a DataFrame in pandas when some rows had “missing” data. Here’s an example of the code:

import pandas as pddata = [['bananas', 1128553, 2.49]]
columns = ['product_name', 'purchaser_id', 'price']
df = pd.DataFrame(data, columns=columns)
df.loc[1] = [None, 123, .89]
df.to_csv('test.csv', index=False)

Our initial data set has three values, a string (considered an “object” by pandas), an integer, and a float. When we make it into a DataFrame, those types will be maintained. But if we add a new row to the DataFrame, including None as the first value will cast the integers as floats:

The first paragraph in this pandas documentation explains that NaN (Not a Number) is the library’s default missing value marker and that None will be represented as NaN for the sake of computational speed. In a linked page for Nullable integer data type, we get an explanation that’s very relevant to our issue:

Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

Now we’re getting somewhere. None becomes NaN, which is a float. Our integers are being cast to floats because of missing data, but there’s still one remaining mystery. Why does a float in the “product_name” column change the type of an integer in the “purchaser_id” column?

Rows and Columns

When I approached this problem, I considered everything from a vertical perspective. That is to say that I was looking for some errant data in the specific column that was generating floats, rather than the other column which ended up causing the problem with missing data. Still, it seems strange that the None here would change the type of an integer in a different row and column:

df.loc[1] = [None, 123, .89]

But what if it’s not the None that’s directly affecting the other value? Let’s do an experiment here:

import pandas as pddata = [["bananas", 1128553, 2.49], [None, 123, .89]]
columns = ['product_name', 'purchaser_id', 'price']
df = pd.DataFrame(data, columns=columns)
print(print(f'THESE ARE THE DATA TYPES:\n{df.dtypes}'))

We’re doing something similar to what we did before, but instead of using loc to add a row, we’re generating two rows from the start. We know that None will be represented as NaN and therefore a float. Which columns will that affect?

None of the columns have been altered despite one of them having a missing value! Now if the first instance of “product_name” were an int instead of an object, we know what to expect:

data = [[1, 1128553, 2.49], [None, 123, .89]]

But that still doesn’t break the purchaser_id column, which remains an int. Let’s go back to the setup we had before:

import pandas as pd
data = [["bananas", 1128553, 2.49], [None, 123, .89]]
columns = ['product_name', 'purchaser_id', 'price']
df = pd.DataFrame(data, columns=columns)
print(f'THESE ARE THE DATA TYPES:\n{df.dtypes}')
df.loc[2] = [None, 123, .89]
print(f'THESE ARE THE NEW DATA TYPES:\n{df.dtypes}')

Note that we changed df.loc[1] to df.loc[2] since our initial DataFrame now has a populated row for index 1. Hare are the results:

Aha! This is what we were seeing before. So what changed? It wasn’t just the addition of the None, but the fact that it was done in conjunction with a loc. By virtue of having a new float in the loc row, pandas got confused and cast the second element (123) to a float, causing the whole column to be converted. If we change None to an object "None", we don’t face the same issue:

df.loc[2] = ["None", 123, .89]

Watching our types

The moral of the story here is that we have to be aware of how we’re manipulating and updating data. We learned last time that we can be explicit about our columns to prevent pandas from making assumptions. Now we know that we should be more thoughtful about what type of data we’re adding to our data sets. None isn’t harmless. It might just end up casting a ton of data if we’re not careful.

Sources

--

--

Mike Diaz