Skip to main content

Popular posts from this blog

How to add your Conda environment to your jupyter notebook in just 4 steps

 In this article I am going to detail the steps, to add the Conda environment to your Jupyter notebook. Step 1: Create a Conda environment. conda create --name firstEnv once you have created the environment you will see, output after you create your environment. Step 2: Activate the environment using the command as shown in the console. After you activate it, you can install any package you need in this environment. For example, I am going to install Tensorflow in this environment. The command to do so, conda install -c conda-forge tensorflow Step 3: Now you have successfully installed Tensorflow. Congratulations!! Now comes the step to set this conda environment on your jupyter notebook, to do so please install ipykernel. conda install -c anaconda ipykernel After installing this, just type, python -m ipykernel install --user --name=firstEnv Using the above command, I will now have this conda environment in my Jupyter notebook. Step 4: Just check your Jupyter Notebook, to se...

Apache Spark Discretized Streams (DStreams) with Pyspark

Apache Spark Discretized Streams (DStreams) with Pyspark SPARK STREAMING What is Streaming ? Try to imagine this; in every single second , nearly 9,000 tweets are sent , 1000 photos are uploaded on instagram, over 2,000,000 emails are sent and again nearly 80,000 searches are performed according to Internet Live Stats. So many data is generated without stopping from many sources and sent to another sources simultaneously in small packages. Many applications also generate consistently-updated data like sensors used in robotics, vehicles and many other industrial and electronical devices stream data for monitoring the progress and the performance. That’s why great numbers of generated data in every second have to be processed and analyzed rapidly in real time which means “ Streaming ”. DStreams Spark DStream (Discretized Stream) is the basic concept of Spark Streaming. DStream is a continuous stream of data.The data stream receives input from different kind of sources like Kafka, Kinesis...

Bet you didn’t know this about Airflow!

  We are living in the Airflow era. Almost all of us started our scheduling journey with cronjobs and the transition to a workflow scheduler like Airflow has given us better handling with complex inter-dependent pipelines, UI based scheduling, retry mechanism, alerts & what not! AWS also recently announced managed  airflow  workflows. These are truly exciting times and today, Airflow has really changed the scheduling landscape, with scheduling configuration as a code. Let’s dig deeper. Now, coming to a use case where I really dug down in airflow capabilities. For the below use case, all the references to  task  are for  airflow tasks . The Use case The above DAG consists of the following operations: Start an AWS EMR cluster : EMR is an AWS based big data environment. To understand the use case we don’t need to deep dive into how AWS EMR works. But if you want to, you can read more about it  here . Airflow task_id for this operation:  EMR_start...