What I Learned on a Side Project This Week: Python CSV Parsing
It’s been a long time since I’ve attempted any kind of side-project. A personal project is very different from my job and can be really exhausting; I’m working as an individual rather than a team, there’s less pre-built infrastructure, and the scope can sometimes be larger than what I’d do at work. On top of that, of course, there’s less time because we might be working 40 hours each week (or more, honestly) doing our paid jobs.
So I don’t want to create the impression that I embarked on some large, ambitious project that integrated a bunch of different frameworks. This weekend, I more or less wrote a fairly straightforward algorithm that integrated the most basic aspects of Python’s csv library. I’m going to try and retrace my steps to guide any readers who could use help figuring out how to get started or getting around blockers when they get stuck.
My Goal
I like to play fantasy football with my friends. For those who aren’t familiar, the basic principles of the game are that a group of players (my friends and I) select a series of NFL (American Football) players to represent our “teams.” Whenever one of the football players does well in their real-life game, they score “fantasy points” and the manager whose players score the most points wins the game.
One member of our league came up with a metric to represent the “Player of the Week,” who is the football player who presented the most value in that week’s games. It’s easy to look up player scores, but this metric is sometimes difficult to report on because:
- Our players are split up into two “divisions” and we want the top scorer from each. But “division” is not a natural attribute of a football player, since it is specific to our league.
- The “Player of the Week” is not the top scorer overall, but instead the player who scored the most relative to a replacement-level player at his position. If player A is a quarterback and scores 50 points, that’s great. But if every quarterback scored at least 40 points that week, it doesn’t make as much of an impact compared to a running back who scored 30 while everyone else scored 10.
If you’re starting to get confused, congratulations! You now understand why it became really tedious to figure out which players should be “Player of the Week.” My goal was to cut out some of the manual work by using a Python script to read the values from a csv file with player data. Fortunately, our fantasy football platform, Fantrax, allows us to download such a csv, so I didn’t have to figure anything out on that end.
Parsing a csv with Python
My first goal was to get the csv data into a format I could manipulate. My understanding is that this is most easily done in Python, so I searched for a tutorial that would walk me through a simple example. I found this video, which explained all I needed to know within the first 10 minutes
I really like videos because they give me more context than plain documentation. I can see the file tree and the full page of code, rather than a snippet. In this case, all I had to do was add a csv file to my project and pass the filename as an argument to the open function that I imported as part of the csv library. That function returns a file object which we can loop through. Each line is like an list, where its various properties can be accessed through indexing. We’ll use my csv as an example:
The first line contains headers, which will become relevant later. For now, we should just understand that if we convert this to a file object via open, the first time we loop through, our iterator will represent a collection that functions like this:
["ID", "Player", "Team", "Position",...]
And so if we want to return “ID”, we could run:
for line in csv_reader:
return line[0]
I ran a test with print statements to make sure this was working correctly. Once I had established that, I moved forward to start writing logic that would select the correct “Players of the Week.”
Sort by Position
The first thing I wanted to do was split the players up based on what position they play, since that’s important for determining their relative value. I wrote a function that would iterate through a collection and return a filtered version that would only include players with a certain value for their position. It looked like this:
def get_position(csv_input, position):
result = [] for line in csv_input:
if line[3] == position and len(result) <=
position_replacement_numbers[position]:
result.append([line[1], line[5], line[10]]) return result
I created an empty list and then iterated through my whole csv, looking for rows where the third element, which notes the player’s position, matches the value passed as the function’s second argument (if I wanted all running backs, I’d pass ‘RB’). I also added this condition:
len(result) <= position_replacement_numbers[position]
The csv I exported from Fantrax contained data on nearly every football player on an NFL roster, but the vast majority of them don’t play enough to be relevant for this exercise. I realized that I only had to collect data until I reached the “replacement level” player, which usually means 10–30 players. The exported csv was already sorted with the highest scores at the top, so I knew that I could stop my loop as soon as I reached that replacement level. I created a dictionary that mapped positions to number of players that I wanted to collect:
position_replacement_numbers = {
'QB': 21,
'RB': 21,
'WR': 31,
'TE': 11,
'K': 11,
'DST': 11
}
In fact, I am realizing right now that my code unnecessarily loops through the whole csv even once the results list is full, but would be more efficient if the loop stopped at that point. Yet another advantage to writing about your code!
Finally, the row had some data that I didn’t need, so I only appended three relevant fields: the player’s name (index 1), the player’s fantasy football team (index 5), and the player’s score for that week (index 10).
Function Issues
This function worked well when I tested it on one position, so I next tried to invoke it six times to create different positional lists, passing a different argument each time. I still don’t understand why, but this never worked after the first list. I tried to debug and found that my for loop wasn’t running when I ran the function with the same csv_input object a second time. So I closed my computer and went to bed.
When I woke up the next morning, I realized that I could refactor the function to populate all positional lists upon being run a single time. Rather than returning a list, I could return a tuple of lists (a tuple is a built-in Python data type that stores multiple elements in a single variable)! Here’s what I came up with:
def get_position(csv_input):
result = ([], [], [], [], [], [])
for line in csv_input:
line_position = line[3]
tuple_index = position_to_index[line_position]
if len(result[tuple_index]) <= position_replacement_numbers[line_position]:
result[tuple_index].append([line[1], line[5], line[10]])
return result
I created another dictionary that mapped position values to indexes in my tuple (I realize this might have been more efficient as a list, but I think that would have been harder to read):
position_to_index = {
'QB': 0,
'RB': 1,
'WR': 2,
'TE': 3,
'K': 4,
'DST': 5
}
Organize by Division
The truth is that I didn’t catch the bug in get_position until I tried to pass the resulting lists into a function that would choose the top performers from each division:
def get_top_players(positional_tuple):
mario_division = ['', '', '', 0]
luigi_division = ['', '', '', 0]
for position in positional_tuple:
top_mario = top_player_by_division(position, 'mario')
top_luigi = top_player_by_division(position, 'luigi')
if top_mario[3] > mario_division[3]:
mario_division = top_mario
if top_luigi[3] > luigi_division[3]:
luigi_division = top_luigi
return (mario_division, luigi_division)
Our divisions are named “Mario” and “Luigi.” Each division contains five teams, so I created yet another dictionary to use as a reference:
division_teams = {
'mario': ['HAR', 'VA', 'CR', 'LIC', 'YON'],
'luigi': ['FLU', 'OAK', 'MARS', 'NE', 'GK']
}
As I had done with my previous function, I set placeholder variables that I would ultimately populate with my results. Here, I iterate through the tuple that contains 6 elements, each a list of players. For each one of these elements, I pick out a top player for the Mario and Luigi divisions using top_player_by_division:
def top_player_by_division(list_of_players, division_name):
last_players_score = list_of_players[-1][2]
for row in list_of_players:
if row[1] in division_teams[division_name]:
return [row[0], row[1], row[2], float(row[2]) - float(last_players_score)]
We know that list_of_players is already sorted from highest-to-lowest scoring, but we can’t just choose the top player because we don’t know which division they represent. In our for loop, we check whether the element’s team name matches one in the list that I’ve defined for that division in the division_teams dictionary. The first time that turns out to be true, we want to short circuit the loop and just return that player, because we know he’s going to be the top scorer.
But we don’t just return the row as-is, because we want to add one more element. This element represents the difference in points scored between the selected player and a replacement-level player. Here’s the logic:
float(row[2]) - float(last_players_score)
We can see that last_players_score is defined at the very beginning of the function:
last_players_score = list_of_players[-1][2]
If you’re a JavaScript fan like me, you might not know (or have forgotten) that a lot of languages have the super handy shorthand for using index -1 to select the last element of a collection. Our limits for the number of players that we collect in get_position means that the last element in each of the resulting lists will be the replacement-level standard for the week. If we subtract that player’s score from any other player’s, we can see their value over replacement. That number will be important later, as we compare the top performers at each position to see which one has the highest value over replacement, and therefore should be “Player of the Week.”
When I first wrote this function, I didn’t include the float() portion, which casts my values as floats (numbers with decimal points). And so I got this error:
I had forgotten that all the values I was pulling from my csv would be strings, and therefore I had to change their data type if I wanted to perform mathematical operations on them. Once I solved that, though, my function was returning a top_performer list that included a value over replacement element. My parent function, get_top_players, ran that for each different position and then checked to see whether the top player was better than the one that had come before it:
if top_mario[3] > mario_division[3]:
mario_division = top_mario
if top_luigi[3] > luigi_division[3]:
luigi_division = top_luigi
The first time the logic ran, it compared the value over replacement element (index 3) to my placeholder, which was set to 0. And so it set those values to the first results. But each subsequent time, it would replace that player with another if their value over replacement was higher. With that done, all I had to do was print out my results:
(top_mario_player, top_luigi_player) = get_top_players(positional_tuple)print(f'Mario Division potw: {top_mario_player[0]}')
print(f'Luigi Division potw: {top_luigi_player[0]}')
Test and Refactor
I ran my code and…immediately got an error:
Remember way back in the beginning of this post where I pointed out that the first row in my csv had headers instead of player values? That turned out to be a problem because, as I tried to iterate through the rows, I was reading values that were not accounted for in my code. I was lucky that the tutorial video I watched covered this situation, noting that it could be easily handled by passing our csv_reader object into a next function. Adding that fixed the problem and exposed my one last error:
Here’s where knowledge of the domain comes in really handy. As soon as I saw this, I knew exactly what was happening because I’ve played fantasy football for a long time: some players fit neatly into positional categories, but others can play more than one position, so they may have a value like ‘RB,WR’ in their positional place. I had to answer two questions about these outliers:
- Do I want classify them as playing one position, both positions, or ignore them completely?
- Once I decide what I want to do, how should I go about it?
These players definitely factor into the game, so I don’t want to ignore them. Likewise, since they can be played at either position, I think they should be included in both positional lists. Here’s how I coded that out:
def get_position(csv_input):
result = ([], [], [], [], [], [])
next(csv_input)
for line in csv_input:
line_position = line[3]
position_list = [line_position] if ',' not in line_position else line_position.split(',')
for position in position_list:
tuple_index = position_to_index[position]
if len(result[tuple_index]) <= position_replacement_numbers[position]:
result[tuple_index].append([line[1], line[5], line[10]])
return result
Originally, I just assigned each row one position within my for loop, but now we’re thinking we want certain rows to be counted twice, if they have two positions. So now each row’s positions will be represented in their own list:
position_list = [line_position] if ',' not in line_position else line_position.split(',')
Once again for the JavaScript natives, this syntax is like a ternary operator for Python. We’re conditionally setting our variable’s value as one thing if our condition is met (that being that there is no comma in our position value, like if the value is ‘RB’), but something else in another case. Without this addition, we would be looking at line_position, which could be:
'RB'
# or
'RB,WR'
Now we’ll be looking at:
['RB']
# or
['RB','WR']
This makes a huge difference, because we can iterate through the latter more naturally. And that’s exactly what I did:
for position in position_list:
tuple_index = position_to_index[position]
if len(result[tuple_index]) <= position_replacement_numbers[position]:
result[tuple_index].append([line[1], line[5], line[10]])
And success!
Iteration
I’ve used the word “iterate” a lot today to talk about moving through a collection in our code, but it’s perhaps more relevant here in relation to examining our process. I wanted to make sure to make it clear that, even for a simple problem, I made mistakes and ran into error messages. I had to iterate through the problem and break it into smaller pieces to make this work, and even so I had to split the work over two days. It’s important to remember that coding is a process and that process always takes time.
If you want to check out my full code, it’s available on GitHub here.
Sources
- Python Tutorial: CSV Module — How to Read, Parse, and Write CSV Files, Corey Schafer
- Reading and Writing CSV Files in Python, Jon Fincher
- Python Tuples, W3 Schools