Skip to main content

Spark Structured Streaming Simplified

 

Hey folks!

If you are a beginner with stream processing with Spark, or even if you have used it multiple times but want to get a better understanding of spark structured streaming, then this article is for you!

Before discussing processing stream data in Spark, let’s first understand what stream data processing is and how does it vary from batch data processing? If you are already familiar with these concepts, please skip ahead.

Stream data processing

Processing of data as and when it comes to make near real time decisions.

For example, Fraud detection as and when it happens, detection of an erroneous server by analyzing the error rate, etc.

How does stream processing differ from batch processing?

Batch data processing is processing of data accumulated over a period of time.

These are the normal flows/jobs that you have running say at daily, weekly, or twice a day frequency. No matter when the data comes, it will always be processed at fixed defined intervals.

Another difference I want to highlight here is that in terms of reliability, batch processing is always more reliable than aggregates generated via stream processing, especially if the stream processing is not configured properly, but stream processes allow for quicker interpretation of data as compared to batch processing.

Another general difference is stream processing generally takes less memory as compared to batch processing since, at a time, you are processing fewer data.

Data processing in batch vs stream(Image by Author)

Having defined stream data processing, let’s dig into how to do the same using Spark i.e. Spark Structured Streaming.

Spark Structured Streaming

Spark structured streaming allows for near-time computations of streaming data over Spark SQL engine to generate aggregates or output as per the defined logic.

This streaming data can be read from a file, a socket, or sources such as Kafka.

And the super cool thing about this is that the core logic of the implementation for processing is very closely related to how you would process that data in batch mode. Basically, you can define data frames and work with them how you normally do while writing a batch job, but the processing of data differs.

One thing important to note here is structured streaming does not process the data in real-time but instead in near-real-time. In fact, rarely will you find a pipeline or a system, that processes data in “real” real-time without any delays, but well that’s a separate discussion.

Micro batches

Structured Streaming has a concept of micro-batches to process the data, meaning that not every record is processed as it comes. Instead, they are accumulated in small batches and these micro(tiny batches) are processed together i.e. near real-time. You can configure your micro-batch and can go as low as a few ms. For example:

Micro batches(Image by Author)

Unbounded table

As mentioned above, you work with data frames in a structured streaming job, just like a batch one. But if you think here? How exactly does one create a data frame over stream data? I mean it’s pretty straightforward for batch data if it’s 100 records, it’s a data frame of 100 records but how about a dataframe over stream data that is continuously coming? Take a moment to think here.

Here, comes the concept of unbounded tables: As the data comes, rows get appended to the table as the micro-batches are processed. As the new data comes, computation is done as it is applied to the table till there with the logic defined on the dataframe. So basically, a dataframe is created over this unbounded table.

A thing to note here is, this unbounded table is more of a “conceptual thing”. Spark doesn’t keep the whole unbounded table in memory, but it keeps on writing the result and only maintains a minimal required intermediate state in memory,

The concept of unbounded table(Image by Author)

Triggers

Now how does Spark knows when to generate these micro-batches and append them to the unbounded table?

This mechanism is called triggering. As explained, not every record is processed as it comes, at a certain interval, called the “trigger” interval, a micro-batch of rows gets appended to the table and gets processed. This interval is configurable, and has different modes, by default mode is start the next process as previous finishes.

The concept of triggering(Image by Author)
An example of triggering(Image by Author)

Output modes

So now we know that at a trigger interval, processing of data begins and output is generated as per defined logic. Let’s say you want to process total of successful, failed, or pending transactions at trigger interval so this result will be generated.

For writing this result, or the “output”, there are different modes to do the same:

  • Complete: The entire result is stored or written as output
  • Append: Only the new records in the result are written as output
  • Update: Only the updated records in the result are written as output

This output mode is configurable and can be set as per the use case. To generate these results,

Fault tolerance

Now in a perfect world, once you start stream processing, everything runs smooth, the flow never shuts down or throws any errors(Sounds like a dream, doesn’t it?).

But sadly, in the real world, the pipelines do fail or have to be shut down or even restarted to apply new changes in the logic. In that scenario, what happens to “stream” data processing, umm?

Structured streaming handles this with checkpointing and write-ahead logs. Basically in simple terms, remembering and storing on some location on HDFS(or S3 whatever the underlying storage layer is) that till where data has been processed ensuring Fault Tolerance. So that data is processed exactly once and ensures idempotency.

And that basically is the whole summary of the structured streaming. I will stop here, and let you absorb this information.

There’s a lot more interesting stuff to stream processing, especially structured streaming here like stream stream joins, small file issues, late data handling, window-based processing. But not to overwhelm you here, I will do a couple of follow-up articles on these.

Comments

Popular posts from this blog

Apache Spark Discretized Streams (DStreams) with Pyspark

Apache Spark Discretized Streams (DStreams) with Pyspark SPARK STREAMING What is Streaming ? Try to imagine this; in every single second , nearly 9,000 tweets are sent , 1000 photos are uploaded on instagram, over 2,000,000 emails are sent and again nearly 80,000 searches are performed according to Internet Live Stats. So many data is generated without stopping from many sources and sent to another sources simultaneously in small packages. Many applications also generate consistently-updated data like sensors used in robotics, vehicles and many other industrial and electronical devices stream data for monitoring the progress and the performance. That’s why great numbers of generated data in every second have to be processed and analyzed rapidly in real time which means “ Streaming ”. DStreams Spark DStream (Discretized Stream) is the basic concept of Spark Streaming. DStream is a continuous stream of data.The data stream receives input from different kind of sources like Kafka, Kinesis

Reference Hadoop HDFS config Files

Trong Hadoop HDFS (Hadoop Distributed File System), có một số file cấu hình quan trọng để tùy chỉnh và điều chỉnh các thành phần của hệ thống. Dưới đây là một số file cấu hình quan trọng trong Hadoop HDFS và ý nghĩa của chúng: 1./ hdfs-site.xml : File này chứa cấu hình cho các thuộc tính liên quan đến HDFS. Đây là nơi bạn có thể thiết lập các cấu hình như kích thước block mặc định, số lượng bản sao dữ liệu, quyền truy cập, v.v. Điều chỉnh các giá trị trong file này có thể ảnh hưởng đến hiệu suất và tính sẵn sàng của HDFS. 2./ core-site.xml: File này chứa cấu hình cho các thuộc tính cơ bản của Hadoop. Nó bao gồm thông tin về tên miền Hadoop, địa chỉ máy chủ NameNode và các cài đặt liên quan đến mạng như cổng giao tiếp và giao thức. 3./ hdfs-default.xml : Đây là file mẫu chứa tất cả các thuộc tính có thể được cấu hình trong HDFS. File này cung cấp mô tả chi tiết và giá trị mặc định của mỗi thuộc tính. Nếu bạn muốn thay đổi một thuộc tính nào đó, bạn có thể sao chép nó vào hdfs-s

Khác nhau giữa các chế độ triển khai giữa Local, Standalone và YARN trong Spark

Trong Apache Spark, có ba chế độ triển khai chính: Local, Standalone và YARN. Dưới đây là sự khác biệt giữa chúng: Chế độ triển khai Local: Chế độ triển khai Local là chế độ đơn giản nhất và được sử dụng cho môi trường phát triển và kiểm thử. Khi chạy trong chế độ Local, Spark sẽ chạy trên một máy tính duy nhất bằng cách sử dụng tất cả các luồng CPU có sẵn trên máy đó. Đây là chế độ phù hợp cho các tác vụ nhỏ và không yêu cầu phân tán dữ liệu. Chế độ triển khai Standalone: Chế độ triển khai Standalone cho phép bạn triển khai một cụm Spark độc lập bao gồm nhiều máy tính. Trong chế độ này, một máy tính được chọn làm "Spark Master" và các máy tính khác được kết nối với Spark Master như là "Spark Workers". Spark Master quản lý việc phân phối công việc và quản lý tài nguyên giữa các Spark Workers. Chế độ Standalone phù hợp cho triển khai Spark trên các cụm máy tính riêng lẻ mà không có hệ thống quản lý cụm chuyên dụng. Chế độ triển khai YARN: YARN (Yet Another Resource N