IT Blog

Apache Spark vs. Sqoop: Engineering a better data pipeline

As a data engineer building data pipelines in a modern data platform, one of the most common tasks is to extract data from an OLTP database or data warehouse that can be further transformed for analytical use-cases or building reports to answer business questions. Apache Sqoop quickly became the de facto tool of choice to ingest data from these relational databases to HDFS (Hadoop Distributed File System) over the last decade when Hadoop was the primary compute environment. Once data has been persisted into HDFS, Hive or Spark can be used to transform the data for target use-case. As adoption of Hadoop, Hive and Map Reduce slows, and the Spark usage continues to grow, taking advantage of Spark for consuming data from relational databases becomes more important. Before we dive into the pros and cons of using Spark over Sqoop, let’s review the basics of each technology: Apache Sqoop is a MapReduce-based utility that uses JDBC protocol to connect to a database to query and transfer data...

IT Blog

Search This Blog

Posts

Apache Spark vs. Sqoop: Engineering a better data pipeline