This month at work, I’m trying to graduate from understanding individual processes to understanding a full pipeline of data from frontend to back end. As a Solutions Engineer, I’m frequently asked to diagnose an issue with data failing to make it all the way through a pipeline with clues like “we’re seeing a network call on the front end, but no corresponding row in Snowflake on the back end.” I learned that, in my workplace, these questions are often difficult to answer because our back end EventHandler doesn’t actually build rows in a DB. Instead, it sends data through a Kinesis Stream and after some time (usually) that data makes it to the DB on the other end.
Why Data Streams?
Before I even started studying our data pipeline, I knew that “The Kinesis Stream” was a black box. Few knew what logic existed inside or what stops it made on its way to the DB. So my question was…if this process is really hard to understand and might be causing issues like missing data, why use it at all? Can’t we just use an ORM or some other framework to write directly to a DB? That’s what I would have done if I were working on a personal project, but it likely wouldn’t scale on a professional level.
There’s a limit to the number of API calls we can make every second before our entire operation bottlenecks and grinds to a halt. If a company has 10,000 clients and each one is tracking 10,000 events on their site every day, we’ll run into issues. Traditionally, this was addressed with batch processing, which would periodically send large batches of stored data to the DB or wherever it needed to go. But batch processing comes with latency because data isn’t being sent in real time. That’s where a data stream comes in.
Data streams boast the ability to reflect data in real time, or at least closer to real time than batch processing. Even if you’re not a back end or data engineer, you may be familiar with the concept of streaming video, which is actually an example of a data stream. Before streaming video, we would have to download an entire video file before watching; like with batch processing we could not interact with any of the data until all of it was moved. A data stream allows us to watch part of a video while another part is loading. Each new piece of data is immediately accessible.
This is useful outside of video for any domain that values real-time data. Initially this included things like financial data, medical monitoring, or street traffic, but now it appeals to nearly every business out there. The ability to see data more quickly combined with the ease of use thanks to more data stream options is too good to pass up. Surely that’s also why my company uses a Kinesis stream to get information into our database.
Amazon Kinesis is a popular data streaming solution which boasts the capacity to reflect big data in real time. If you’re like me and you’ve heard Big Data a lot but don’t know what it is…it’s basically just a lot of data. A lot of varied data coming in at high velocity, but pretty simply put it’s a lot of data. And Kinesis streams help us handle, process, and move that data in real time.
AWS Kinesis is actually broken up into three services, Kinesis Streams, Kinesis Analytics, and Kinesis Firehose. Streams ingests data, Analytics provides insights and lets us build alerts, and Firehose delivers data to an S3 bucket, a DB, etc in near real time (not exactly real time, but “near”). This really helpful intro video broke down Streams using a diagram that looks like this:
The Producer is whatever is sending data into the stream. The consumer is where the stream goes, like Analytics or Firehose. What we see in between is the stream itself, broken up into shards. A shard is basically a partition that we can use to differentiate our data. We can partition our data into as many shards as we like, which of course means AWS bills on a per-shard basis. Shards are populated by records, which are each made up of:
- a Data Blob containing the actual information we want to move through the stream,
- a Record Key that we provide to indicate which shard our record should be a part of,
- and a Sequence Number, which is assigned by the shard itself to indicate when it ingested the record and to differentiate records. Records are sequenced in the order they are received.
While data is in the stream, it can be consumed any number of times by any number of consumers. We’ll only lose access to our data after Kinesis has stopped retaining it (24 hours by default, up to 7 days if we choose). There are limits to how much data we can send into and out of our stream, but they are parsed out on a per-shard basis so we always have the option to pay for more shards if need be.
A Stream of Knowledge
I also spent some time this week learning about domains and ubiquitous language. Within the domain of my company, and many other engineering organizations, data stream is a ubiquitous term. That is to say that it is frequently used and has a common meaning that’s understood within the domain. It’s critical that everyone working in a domain have a common understanding of ubiquitous terms, so I’m encouraged that I can now more easily have conversations about our data pipeline. And I’m eager to continue my journey into unlocking the many ubiquitous terms of experienced engineers.
- What is Data Streaming, Garrett Alley
- Amazon Kinesis Introduction, Stephane Maarek
- Data Streaming Explained: Pros, Cons & How It Works, Jonathan Johnson