What I Learned at Work this Week: Flush Your Buffer

Mike Diaz
5 min readJun 25, 2023

--

Photo by Mike: https://www.pexels.com/photo/depth-of-field-photography-of-file-arrangement-1181772/

One would think that since I have spent over a year working on a Python script that writes data to CSVs, I would understand the basics of Python file writers. But if you’ve been a loyal read of this blog, you know that it’s never safe to assume I know the basics of…anything. This week, I was debugging a refactor of the script and a more experienced engineer asked about my use of flush and it made me realize that I didn’t understand where I should put it or what it actually did. It turns out that this was a pretty important detail, since failing to properly insert flush resulted in totally empty files.

Let’s Write a CSV File

I’ve written about writing to a CSV before, but it was a while ago. Let’s write some new code to catch up on the different steps in this process:

import tempfile
import csv

if __name__ == '__main__':
fp = tempfile.NamedTemporaryFile(
mode='w+t', prefix='diaz_test_file'
)
...

We’ll start by importing a couple of modules: tempfile, which creates a temporary storage area for our data, and csv, which we can use to read a collection and write comma separated results to our temporary file. Next we do the classic if __name__ == '__main__' as the start of our Python logic. As we begin our logic, we create a tempfile using NamedTemporaryFile. This is similar to the basic temporary file generated by the tempfile library, but it has a visible name in memory, which we can make even easier to reference using the prefix parameter. The mode parameter indicates what we’d like to be able to do to the file (w for write and t for text mode, as opposed to the default binary mode).

    field_names = ['id', 'food']
list_of_data = [
{'id': '1', 'food': 'banana'},
{'id': '2', 'food': 'grapes'}
]
csv_writer = csv.DictWriter(
fp,
fieldnames=field_names,
quotechar='"',
quoting=csv.QUOTE_ALL,
delimiter=',',
)
csv_writer.writeheader()
csv_writer.writerows(list_of_data)

The first part of the code gave us a temporary file we could reference with the constant fp. We’ll need that to invoke our DictWriter, a csv method which writes a dictionary into a file as comma separate values. The arguments we pass are:

  • the file itself (fp)
  • a list of fieldnames (the names of our columns)
  • quotechar, which specifies what type of quote we’ll use around strings in our csv
  • quoting, used here to make types more consistent. We specify that we want everything to be in quotes, even numbers, so that there’s no confusion if the csv has to be ingested by a program later.
  • the delimiter, a comma by default, but I’ve gotten requests to make pipe → | separated files instead of comma. It’s specified here for clarity.

The DictWriter, here named csv_writer, is just a factory, so we have to invoke it to write to our temporary file. We invoke writeheader to put the column names in the file, then writerows to populate the two rows I defined at the top of this code block. At this point, we can try to check on our progress:

    print(f'HERE IS THE TEMPORARY FILE: {fp}')
exit(0)

Unfortunately, the results are abstract:

HERE IS THE TEMPORARY FILE: <tempfile._TemporaryFileWrapper object at 0x10c132c50>

We could print the contents of the file using the open function, but if we’re doing that, we might as well just try to convert what we’ve got to a non-temporary local file and read that:

    filename = 'diaz_file.csv'
with open(filename, "w+") as output:
with open(fp.name) as data:
buffer = data.read()
output.write(buffer)

The first with open() syntax creates a new file that we can write to (hence w+). We open the existing temporary file within that scope using fp.name as a reference point, read the data, then write it to the local file. If we’ve done this correctly, we should see this file show up in the directory we’ve run the script from:

And if you’ve done exactly what I’ve done, you can open the file and see….nothing.

flush

Observant readers will notice that I glossed over a constant name in the above code: buffer. I glossed over mentions of buffer when I was debugging as well, which greatly increased my debugging time. According to TutorialsPoint, the purpose of internal buffers, which are created by the runtime, library, and programming language that you’re using, is to speed up operations by preventing system calls on every write. In our sample code, we’re not actually writing to the file object when we run output.write(buffer). We write to the buffer until it’s full and then it automatically dumps onto the file. The buffer will automatically be flushed if we close the file we’re reading, but that won’t work well for our purposes.

File Size

Writing this entry helped me understand why I was seeing especially unusual behavior while testing at work: In some cases, files would be populated, but others would not be. This perplexed me because I was running the same logic on different data sets, so why would it work inconsistently? Now we see that our buffer was probably not being filled with the smaller data sets, so it was never flushing itself. And, in fact, the larger data sets were also likely not fully written, they just appeared to be because I couldn’t tell the difference between 1,000 rows and 1,142 rows.

I’ll definitely remember that if I ever see some empty files from a writer that I should check on the buffer. Perhaps more importantly, I’ll remember that I should take the time to understand the basic functionalities that I might have taken for granted, or been too embarrassed to admit I didn’t completely grasp.

Sources

--

--

Mike Diaz
Mike Diaz

No responses yet