Skip to main content

Delta lake with Spark: What and Why?

 

Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:
  1. Data “overwrite” on the same path causing data loss in case of Job Failure.
  2. Updates in the data.

Sometimes I solved above with Design changes, sometimes with the introduction of another layer like Aerospike, or sometimes by maintaining historical incremental data.

Maintaining historical data is mostly an immediate solution but I don’t really like dealing with historical incremental data if it’s not really required as(at least for me) it introduces the pain of backfill in case of failures which may be unlikely but inevitable.

The above two problems are “problems” because Apache Spark does not really support ACID. I know it was never Spark’s use case to work with transactions(hello, you can’t have everything) but sometimes, there might be a scenario(like my two problems above) where ACID compliance would have come in handy.

When I read about Delta Lake and its ACID compliance, I saw it as one of the possible solutions for my two problems. Please read on to find out how the two problems are related to ACID compliance failure and how delta lake can be seen as a savior?

What is Delta Lake?

Delta Lake Documentation introduces Delta lake as:

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake key points:

  • Supports ACID
  • Enables Time travel
  • Enables UPSERT

How Spark fails ACID?

Consider the following piece of code to remove duplicates from a dataset:

# Read from HDFS
df = spark.read.parquet("/path/on/hdfs") # Line 1
# Remove duplicates
df = df.distinct() # Line 2
# Overwrite the data
df.cache() # Line 3
df.write.parquet("/path/on/hdfs", mode="overwrite") # Line 4

For my spark application running above piece of code consider a scenario where it fails on Line 4, that is while writing the data. This may or may not lead to data loss. [Problem #1: As mentioned above].You can replicate the scenario, by creating a test dataset and kill the job when it’s in the Write stage.

Let us try to understand ACID failure in spark with the above scenario.

A in ACID stands for Atomicity,

  • What is Atomicity: Either all changes take place or none, the system is never in halfway state.
  • How spark fails: While writing data, (at Line 4 above), if a failure occurs at a stage where old data is removed and new data is not yet written, data loss occurs. We have lost old data and we were not able to write new data due to job failure, atomicity fails. [It can vary according to file output committer used, please do read about File output committer to see how data writing takes place, the scenario I explained is for v2]

C in ACID stands for Consistency,

  • What is Consistency: Data must be consistent and valid in the system at all times.
  • How Spark fails: As seen above, in the case of failure and data loss, we are left with invalid data in the system, consistency fails.

I in ACID stands for Isolation,

  • What is Isolation: Multiple transactions occur in isolation
  • How spark fails: Consider two jobs running in parallel, one as described above and another which is also using the same dataset, if one job overwrites the dataset while other is still using it, failure might happen, isolation fails.

D in ACID stands for Durability,

  • What is Durability: Changes once made are never lost, even in the case of system failure.
  • How spark might fail: Spark really doesn’t affect the durability, it is mainly governed by the storage layer, but since we are losing data in case of job failures, in my opinion, it is a durability failure.

How Delta Lake supports ACID?

Delta lake maintains a delta log in the path where data is written. Delta Log maintains details like:

  • Metadata like
    - Paths added in the write operation.
    - Paths removed in the write operation.
    - Data size
    - Changes in data
  • Data Schema
  • Commit information like
    - Number of output rows
    - Output bytes
    - Timestamp

Sample log file in _delta_log_ directory created after some operations:

After successful execution, a log file is created in the _delta_log_ directory. The important thing to note is when you save your data as delta, no files once written are removed. The concept is similar to versioning.

By keeping track of paths removed, added and other metadata information in the _delta_log_, Delta lake is ACID-compliant.

Versioning enables time travel property of Delta Lake, which is, I can go back to any state of data because all this information is being maintained in _delta_log_.

How Delta Lake solves my two problems mentioned above?

  • With the support for ACID, if my job fails during the “overwrite” operation, data is not lost, as changes won’t be committed to the log file of _delta_log_ directory. Also, since Delta Lake, does not remove old files in the “overwrite operation”, old state of my data is maintained and there is no data loss. (Yes, I have tested it)
  • Delta lake supports Update operation as mentioned above so it makes dealing with updates in data easier.

Comments

Popular posts from this blog

6 Rules of Thumb for MongoDB Schema Design

“I have lots of experience with SQL and normalized databases, but I’m just a beginner with MongoDB. How do I model a one-to-N relationship?” This is one of the more common questions I get from users attending MongoDB office hours. I don’t have a short answer to this question, because there isn’t just one way, there’s a whole rainbow’s worth of ways. MongoDB has a rich and nuanced vocabulary for expressing what, in SQL, gets flattened into the term “One-to-N.” Let me take you on a tour of your choices in modeling One-to-N relationships. There’s so much to talk about here, In this post, I’ll talk about the three basic ways to model One-to-N relationships. I’ll also cover more sophisticated schema designs, including denormalization and two-way referencing. And I’ll review the entire rainbow of choices, and give you some suggestions for choosing among the thousands (really, thousands) of choices that you may consider when modeling a single One-to-N relationship. Jump the end of the post ...

Apache Spark Discretized Streams (DStreams) with Pyspark

Apache Spark Discretized Streams (DStreams) with Pyspark SPARK STREAMING What is Streaming ? Try to imagine this; in every single second , nearly 9,000 tweets are sent , 1000 photos are uploaded on instagram, over 2,000,000 emails are sent and again nearly 80,000 searches are performed according to Internet Live Stats. So many data is generated without stopping from many sources and sent to another sources simultaneously in small packages. Many applications also generate consistently-updated data like sensors used in robotics, vehicles and many other industrial and electronical devices stream data for monitoring the progress and the performance. That’s why great numbers of generated data in every second have to be processed and analyzed rapidly in real time which means “ Streaming ”. DStreams Spark DStream (Discretized Stream) is the basic concept of Spark Streaming. DStream is a continuous stream of data.The data stream receives input from different kind of sources like Kafka, Kinesis...

How to add your Conda environment to your jupyter notebook in just 4 steps

 In this article I am going to detail the steps, to add the Conda environment to your Jupyter notebook. Step 1: Create a Conda environment. conda create --name firstEnv once you have created the environment you will see, output after you create your environment. Step 2: Activate the environment using the command as shown in the console. After you activate it, you can install any package you need in this environment. For example, I am going to install Tensorflow in this environment. The command to do so, conda install -c conda-forge tensorflow Step 3: Now you have successfully installed Tensorflow. Congratulations!! Now comes the step to set this conda environment on your jupyter notebook, to do so please install ipykernel. conda install -c anaconda ipykernel After installing this, just type, python -m ipykernel install --user --name=firstEnv Using the above command, I will now have this conda environment in my Jupyter notebook. Step 4: Just check your Jupyter Notebook, to se...