Part 1 — Getting Started with Apache Airflow

Jyoti Sachdeva
5 min readMay 24, 2021

Hey, welcome to the airflow series: Part 1 — Getting started with Apache Airflow.

In this blog we will cover what airflow it, its pros/cons , need and when to and when not to use Airflow. We will deep dive into its concepts in later part of series.

Without delay, let’s get started.

As widely and correctly said:

Necessity is the mother of invention

If we do not face issue, we will never try to find it’s solution. The same happened with Airflow.

In 2015, Airbnb faced a problem. They had a massive amount of data that was gradually increasing. They wanted to automate processes by writing scheduled batch jobs.

Cron had lot of issues, what were they?

1. Error Handling: What if the job fails and we want to retry the run and at certain intervals?

2. Tracking your jobs: It was difficult to track the jobs, why it is taking a lot of time. There was no user interface.

3. What if you job has complex dependencies, you want to run one task only after certain tasks are successfully completed. Dependencies between the tasks were hidden.

4. Logs were not organized at the central place. They might be distributed among several servers and applications. If something goes wrong, there was no easy way of tracking the root cause.

5. What if you want to run the historic run? May be want to get data which is a year old? Will you be changing all environment values?

6. Monitoring your tasks to see how it performed, a separate environment was required.

They are a lot. Isn’t?

To overcome all the challenged, Maxime Beauchemin created Airflow with the idea that it would allow them to quickly author, schedule , and monitor batch data pipelines.

Airflow is not a data streaming solution.

The project joined the Apache Software Foundation’s Incubator program in March 2016 and the Foundation announced Apache Airflow as a Top-Level Project in January 2019.

All the disadvantages of cron and many more became the advantages of airflow.

In short we can say:

Apache Airflow is a open source platform to define, schedule, and monitor workflows.

We can define a workflow as any sequence of steps taken to achieve a specific goal.

Apache Airflow is written in Python.

Advantages of Airflow:

  • Error Handling: With airflow, we can easily handle errors and apply retry mechanism which incudes how many retries and at what intervals to be defined very easily.
  • Job Tracking: Airflow provides a nice user interface and to track the job using logs, start time and end time, it’s metadata, sending alerts on success and failure is very convenient.
  • Dependencies: With airflow, you can define the complex upstream and downstream dependencies very easily.
  • Logs: With each task we can easily see the logs, it also allows remote logging example the logs can be stored in s3.
  • Historic Run: Airflow provides metadata using context which has start date, execution date with it and running the historic run is quite simple.
  • Scalable: Airflow is highly scalable up and down. It can be deployed on a single server or scaled up to large deployments with numerous nodes example kubernetes.
  • Dynamic: Airflow is build in python and allows us to write custom code.
  • Extensible: It allows to write custom operators, plugins and executors. Do not worry we will discuss these terms in upcoming series.
  • Configurable: It is designed under the principle of “configuration as code”. Airflow has a config file airflow.cfg which allows a lot of configuration options from controlling parallelism, logs location, email configuration, logging redirection to many more. We will discuss this file in detail later.
  • Monitoring abilities: We can view the status of your tasks from the user interface.
  • Community: Airflow has large and active community.
  • Security: Airflow has many roles which has different permissions and roles such as admin, viewer with levels of permissions such as resource level permission. Also allows different ways to authenticate the user such as web level password authentication, ldap and kerberos. It uses Fernet to encrypt passwords in the connection and variable configuration.

With pros comes the cons as well.

Disadvantages of airflow:

1. Airflow has a nice way of handling dependencies, but it does not allow tasks to share data they can still share metadata, and it results in non atomic tasks.

2. Deploying airflow for production is not easy, there are lot of executors we will discuss they in detail later but for production scalability either celery of Kubernetes is used and setting them is not a easy task. It includes setting up proper logging, workers, scheduler, webserver, metadata database, airlfow.cfg file. We might need a monitoring system such as prometheus/grafana to monitor airflow cluster. Do not be scared of the terms we are gonna discuss them. Though now we have some managed workflow services.

3. Scheduler is a bottleneck in itself. It sometimes takes minutes to pick task from queue to process which results in delay.

4. Some concepts in airflow could be tricky for a new person to understand such as its start date and schedule interval concept. Many would believe that if I have start date of 24–05–2021 at 11 AM. The DAG would start on 24–05–2021 at 11 AM.

When and when not to use airflow?

Airflow was not designed to execute any workflows directly inside of Airflow, but just to schedule them and to keep the execution within external systems.

If your tasks are store data to some external source or they are command to perform some action on external source such as a spark submit job, snowflake transformation, or store data to hdfs/hive after some computations. In short all the data that you task is working on needs not to be passed to next level task but is storing that data or computing to some external storage. All airlfow would do is to specify the correct command at a specified time and order. Now the main question pops up what should a single task can do?

A task is basically a step in your workflow example you want to extract some data from API and load it in table. You can have two tasks here one that extracts the data from API and other that stores the data in API. But tasks can not share data. Oops.

Here we can have a single task that does both. We generally design our workflow in such a manner that comprise of a single individual step example your first task takes some data from hive table and spark submit it and after some computations store it in HDFS. Now the next task runs and it reads the data from HDFS which the first task has stored and performs some computations and store it in HDFS. Here, it makes sense because big data can not be passed in between tasks and we have a level of dependency that second task requires data from first task. Till the time the first task is not completed we should not start the second one.

Also, when the first task fails airflow will not execute task two.

Hope you have enjoyed reading the blog.

For Part 2: Basic Terminologies in Airflow, visit https://jyotisachdeva57.medium.com/part-2-basic-terminologies-in-apache-airflow-1c060a638970

Thank you

--

--