I was a bit busy with a home improvement project this weekend, so I’m taking an opportunity to live up to some standards that we should all aspire to:
- Take time to do important things that aren’t relevant to my career. There are some people, I’m told, who live to code. I am not one of those people, but it’s easy to get sucked up in work when we spend 40+ hours each week on it. Life and the people in it should come first (for me, anyway).
- Stick to my routine, even if I have to alter the standard a bit. I once heard someone say “a half-assed workout is still better than no workout.” Today’s entry isn’t going to be as long as some of my others, but I’m still going to learn something that’ll make me a better programmer.
I was trying to figure out why a Python script wasn’t returning the results I was looking for, so I started to work my way through the logic. Python is not my first (or my second) language and I regularly struggle with the syntax. I came across a batch function that I had a little trouble following:
def batch(iterable, n):
length = len(iterable)
for index in range(0, length, n):
yield iterable[index : min(index + n, length)]
Based on what I know, batch means to take a collection and group its members in a certain way. But this function is iterating through a range and not the passed collection, so how does it work?
Our function takes two arguments: iterable and n. I checked the script to confirm the data types of these values and learned that iterable is an array of SQL query results. Since a query returns a row of data, it’s up to us to decide how to present that in Python code. Though it would be intuitive to use a dictionary (object) with keys and values set to column names and values, this script actually passes all the values in an array. So iterable is a very long array because each row from our query result adds 10+ elements to that array. The other parameter, n, is a constant variable set to an integer, in this case 1000.
for index in range
Now that we’ve got an idea of what the arguments might look like when this function is invoked, we can take a look at the logic. The first line sets length to the number of elements in the iterable array. For the sake of illustration, we can say that the query that generated iterable returned 4 rows of results and that the table itself has 10 columns. In that case, our iterable variable would be an array of 40 elements, so length would equal 40.
Next we use the for keyword, which creates a loop where elements are referred to by the variable name directly after for (index in this case). Next comes in which tells us what we’re iterating through. I know range is going to return a series of numbers, but I forgot that it could accept a third argument. In this case, that argument is n, which we know is set to 1000.
range’s third argument is a step value, which determines the distance between numbers in the range. For example, range(0, 10) returns:
But if we add a step value, as in range(0, 10, 2):
In our case, step is 1000, which makes me wonder how large our iterable data normally is. If the query truly returned only 4 rows, then we wouldn’t come close to reaching the first step. Our range would simply be .
But that might be what we’re looking for, so let’s examine the final, and for me the most confusing, line of our code:
yield iterable[index : min(index + n, length)]
First we have yield, which is like an iterative version of return. While return will exit our function when run, yield will send a value back while continuing to iterate. Our alternative would be to collect values and then return the collection, but the advantage to yield is that we don’t have to store all the data in memory as it’s being collected.
We know that iterable is an array, so my original thought was that the square brackets were referencing a single element in the collection. But something that I forgot about Python is that a colon inside square brackets can be used as a slice. So instead of returning one element from iterable, we’re returning a slice of it, starting from index (in this case, 0) and ending with the lesser of index + n (0 + 1000) and length (40).
When I first read this function, I took an educated guess at what “batch” meant. Now it’s a bit more clear: we’re taking a potentially large group and breaking it into pieces. In my contrived example, the collection was small enough that it didn’t have to be split at all, but if it were thousands of entries long, we would have yielded several results. Batching our results likely makes them easier to process and optimizes the script, which makes us all winners.