Hey folks!
If you are a beginner with stream processing with Spark, or even if you have used it multiple times but want to get a better understanding of spark structured streaming, then this article is for you!
Before discussing processing stream data in Spark, let’s first understand what stream data processing is and how does it vary from batch data processing? If you are already familiar with these concepts, please skip ahead.
Stream data processing
Processing of data as and when it comes to make near real time decisions.
For example, Fraud detection as and when it happens, detection of an erroneous server by analyzing the error rate, etc.
How does stream processing differ from batch processing?
Batch data processing is processing of data accumulated over a period of time.
These are the normal flows/jobs that you have running say at daily, weekly, or twice a day frequency. No matter when the data comes, it will always be processed at fixed defined intervals.
Another difference I want to highlight here is that in terms of reliability, batch processing is always more reliable than aggregates generated via stream processing, especially if the stream processing is not configured properly, but stream processes allow for quicker interpretation of data as compared to batch processing.
Another general difference is stream processing generally takes less memory as compared to batch processing since, at a time, you are processing fewer data.
Having defined stream data processing, let’s dig into how to do the same using Spark i.e. Spark Structured Streaming.
Spark Structured Streaming
Spark structured streaming allows for near-time computations of streaming data over Spark SQL engine to generate aggregates or output as per the defined logic.
This streaming data can be read from a file, a socket, or sources such as Kafka.
And the super cool thing about this is that the core logic of the implementation for processing is very closely related to how you would process that data in batch mode. Basically, you can define data frames and work with them how you normally do while writing a batch job, but the processing of data differs.
One thing important to note here is structured streaming does not process the data in real-time but instead in near-real-time. In fact, rarely will you find a pipeline or a system, that processes data in “real” real-time without any delays, but well that’s a separate discussion.
Micro batches
Structured Streaming has a concept of micro-batches to process the data, meaning that not every record is processed as it comes. Instead, they are accumulated in small batches and these micro(tiny batches) are processed together i.e. near real-time. You can configure your micro-batch and can go as low as a few ms. For example:
Unbounded table
As mentioned above, you work with data frames in a structured streaming job, just like a batch one. But if you think here? How exactly does one create a data frame over stream data? I mean it’s pretty straightforward for batch data if it’s 100 records, it’s a data frame of 100 records but how about a dataframe over stream data that is continuously coming? Take a moment to think here.
Here, comes the concept of unbounded tables: As the data comes, rows get appended to the table as the micro-batches are processed. As the new data comes, computation is done as it is applied to the table till there with the logic defined on the dataframe. So basically, a dataframe is created over this unbounded table.
A thing to note here is, this unbounded table is more of a “conceptual thing”. Spark doesn’t keep the whole unbounded table in memory, but it keeps on writing the result and only maintains a minimal required intermediate state in memory,
Triggers
Now how does Spark knows when to generate these micro-batches and append them to the unbounded table?
This mechanism is called triggering. As explained, not every record is processed as it comes, at a certain interval, called the “trigger” interval, a micro-batch of rows gets appended to the table and gets processed. This interval is configurable, and has different modes, by default mode is start the next process as previous finishes.
Output modes
So now we know that at a trigger interval, processing of data begins and output is generated as per defined logic. Let’s say you want to process total of successful, failed, or pending transactions at trigger interval so this result will be generated.
For writing this result, or the “output”, there are different modes to do the same:
- Complete: The entire result is stored or written as output
- Append: Only the new records in the result are written as output
- Update: Only the updated records in the result are written as output
This output mode is configurable and can be set as per the use case. To generate these results,
Fault tolerance
Now in a perfect world, once you start stream processing, everything runs smooth, the flow never shuts down or throws any errors(Sounds like a dream, doesn’t it?).
But sadly, in the real world, the pipelines do fail or have to be shut down or even restarted to apply new changes in the logic. In that scenario, what happens to “stream” data processing, umm?
Structured streaming handles this with checkpointing and write-ahead logs. Basically in simple terms, remembering and storing on some location on HDFS(or S3 whatever the underlying storage layer is) that till where data has been processed ensuring Fault Tolerance. So that data is processed exactly once and ensures idempotency.
And that basically is the whole summary of the structured streaming. I will stop here, and let you absorb this information.
There’s a lot more interesting stuff to stream processing, especially structured streaming here like stream stream joins, small file issues, late data handling, window-based processing. But not to overwhelm you here, I will do a couple of follow-up articles on these.
Comments
Post a Comment