Part 2 — Basic Terminologies in Apache Airflow

Jyoti Sachdeva
5 min readMay 25, 2021

In Part 1 https://jyotisachdeva57.medium.com/part-1-getting-started-with-apache-airflow-f6c8bd946b91 , we had an introduction of what Apache Airflow is.

In this part, we will go through some of the basic terms in Airflow.

Let’s get started.

We will try to map these concepts with an example: If you want to give surprise birthday party to your friend. You will be following series of steps.

DAG , Task and Operator:

DAG Stands for Directed Acyclic Graph. DAG is a workflow.

Workflow is a series of steps you follow to achieve a specific goal. Each step of workflow is a task. DAGs define the workflow.

To give party, you will invite friends, do some decorations and order cake first then you will be ready for a party. It is a series of steps you follow up in some direction to achieve your goal in this case a birthday party.

Inviting friends is a task. Doing decorations is a task, ordering a cake is a task. These tasks can occur in parallel/serial. Only after all these tasks are successful there would be a successful party.

Directed means you have a direction either upstream or downstream between tasks. These are also called dependencies.

invite_friends >> party

or

invite_friends.set_downstream(party)

This is a downstream task dependency. Once the inviting friends task is over successfully, then only party can happen.

party << invite_friends

or

party .set_upstream(invite_friends)

For party to happen, invitation task must be completed successfully.

[invite_friends , decorations, order_cake] >> party

invite_friends , decorations, order_cake can occur in parallel only after all of them are successful party would happen, that is default airflow behavior which is all_success of tasks in parallel. It can be modified to at least one, at most one etc.

Acyclic means you can’t create loops or deadlocks. Task1 >> Task2 this is ok.

Task1 >> Task2 >> Task1 is not ok. Airflow will show Cycle detected in DAG. Faulty task: Task2 to Task1. It seems easier since there are only two tasks but in reality the workflow could be quite complex.

Graph tasks are determined as vertices and relationships as directed edges.

A DAG defines how to execute the tasks, but doesn’t define what particular tasks do. DAGs indicate how a workflow is going to be executed. While DAGs define the workflow, operators define the work.

Operators determine what actually gets done by a task. Operator is a single task in a workflow.

Task portray the work that is done at each step of your workflow with the actual work that they portray being defined by operators.

For example, invite_friends is a task but operator is deciding which friends to invite, finding their contact details and inviting them

An operator is atomic in sense that we can execute one task if required, it does not need to share data with other operators. invite_friends can happen independently of order_cake.

We can create our own custom operators in Airflow. Operators are of three types:

1. Sensor: they keep on executing at a time interval when certain condition is met else they time out.

2. Transfer: Example — MySQL to hive

3. Action: when you want to execute something such as PythonOperator for executing python function or BashOperator for bash commands.

Task instance and DAG run : An individual run of a single task is a task instance. Individual execution/run of a DAG is a DAG run.

While we define the tasks in a DAG, a DAG run is created. At time of DAG execution, it contains the dag id and the execution date of the DAG to identify each unique run. Each DAG run can consists of multiple tasks and every run of these tasks is referred to as task instances.

The states of task could be, we will discuss scheduler in a minute.

No Status (scheduler created empty task instance) -> Scheduled (scheduler determined task instance needs to run) -> Queued (scheduler sent the task to the queue — to be run) -> Running (worker picked up a task and is now executing it) -> Success/ Failure/ Up for Retry.

If it is up for retry, it would again be scheduled depending on your retry interval, queued, running and Success/ Failure/ Up for Retry. This would happen until there is Success or max retry limit is reached and then it would be marked as fail.

Web Server: Airflow’s user interface.

It also allows us to manage users, roles, connections, pools, variables and different configurations for the Airflow setup.

It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like S3, Google Cloud Storage etc.

We will take the complete UI tour in next part of series.

Metadata Database: Stores the DAG/task states. Airflow uses SqlAlchemy and Object Relational Mapping (ORM) written in Python to connect to the metadata database. It powers how the other components interact. Scheduler polls metadata database and finds out which tasks/DAG needs to be executed and adds to the queue. It also stores information, like schedule intervals, statistics from each run, and task instances. Webserver takes data from metadata to show in user interface. So all processes read and write from here.

The state of the DAGs and their associated tasks are saved in the database to ensure the schedule has updated metadata information.

Default metadata database is Sqlite, for production we use other metadata database such as PostgreSQL.

Scheduler, worker and Executor: This component is responsible for scheduling jobs. This is a multithreaded Python process that uses the DAG object to decide what tasks need to be run, when and where.

The task state is retrieved from the metadata database and updated accordingly. Scheduler keeps on polling dags directory and metadata database of the code and checks the start date and schedule interval and if they are to be executed next adds them to queue. Executor picks up the tasks from queue. The executor decides how work gets done.

The scheduler also has an internal component called Executor. The executor is responsible for spinning up workers and executing the task to completion. Workers run the task that is being handed over by the executor. There are different types of executors default is Sequential which means only one task at a time. But for production other operators such as Kubernetes is generally used. If we have LocalExecutor which means a single node cluster, worker would be on same node. In case of multi node, workers would be distributed across all nodes.

XCom (cross-communication): We said that the data can not be shared between tasks but the metadata can be. But….

Xcom provides a way of sharing messages or data between different tasks. Xcom at its core is a table that stores key-value pairs while also keeping tabs on which pair was provided by which task and dag. Though we use it only when we have small amount of data or it is some metadata of task example- which file the first task has processed but only if it is unavoidable to share this data, use xcom.

Hooks: allows to interface with third-party systems. They’re like building blocks for operators. With hooks, we can connect to outside databases and APIs, such as MySQL .

For setting up airflow 2.0, visit https://jyotisachdeva57.medium.com/part-3-setting-up-airflow-2-0-via-docker-360a93c85428

Thanks for reading!

--

--