Part 3: Setting up Airflow 2.0 via Docker

Jyoti Sachdeva
4 min readMay 28, 2021

Hi, Welcome back.

In previous part, https://jyotisachdeva57.medium.com/part-2-basic-terminologies-in-apache-airflow-1c060a638970 we discussed the basic terminologies in airflow.

In this part of tutorial, we will discuss how to run airflow 2.0 locally.

There are two ways to install airflow locally:

1. PIP https://airflow.apache.org/docs/apache-airflow/stable/start/local.html

2. Docker

We will install latest airflow 2.0 via docker.

Airflow comes with default database as SQLite (not recommended for production as it can not be parallelized) and Sequential executor (runs only one task at a time). We will use PostgreSQL and Local Executor.

For docker, the pre requisites to install are: Docker (https://docs.docker.com/engine/install/) and Docker Compose(https://docs.docker.com/compose/install/)

Let’s get started:

We will create a directory structure like:

dags: is an important folder, every dag definition that you place under dags directory is picked up by scheduler.

scripts: We have a file called airflow-entrypoint.sh in which we will place the commands that we want to execute when the airflow container starts.

.env is the file that we will use to supply environment variables.

docker-compose.yaml is for starting up multiple containers that is webserver, scheduler and metadata based on the dependencies.

Dockerfile is the file where we place all images to be pulled and the libraries to install.

Let’s discuss each one of them in detail.

Dockerfile

FROM apache/airflow
USER root
ARG AIRFLOW_HOME=/opt/airflow
ADD dags /opt/airflow/dags
RUN pip install --upgrade pip
RUN chown -R airflow:airflow $AIRFLOW_HOME
USER airflow
RUN pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org boto3
USER ${AIRFLOW_UID}

apache/airflow is our base image and the airflow home default folder is /opt/airflow then we are adding dags directory from our machine to /opt/airflow/dags in docker container. We can install all python libraries.

.env

AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__FERNET_KEY=81HqDtbqAywKSOumSha3BhWNOdQ26slT6K0YaZeZyPs=
AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=5
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW_VAR_SAMPLE=SampleVar

The executor that we will use is Local so that we can see how multiple tasks can run in parallel.

The scheduler heartbeat interval is 5 seconds. Fernet key is used for encryption and we have defined our metadata connections and to connect with metadata database airflow uses library called sql alchemy.

scripts/airlfow-entrypoint.sh

#!/usr/bin/env bash
airflow db init
airlfow db upgrade
airflow users create -r Admin -u admin -e jyotisachdeva8957@gmail.com -f jyoti -l sachdeva -p admin
airflow scheduler &
airflow webserver

These are the commands that will be executed when the airflow container starts.

Database would be initialized and to login to webserver we need to create our first user.

Then we will start the scheduler in background and at last the webserver.

docker-compose.yaml

version: "2.1"
services:
postgres:
image: postgres:12
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5434:5432"
scheduler:
build:
context: .
dockerfile: Dockerfile
restart: on-failure
command: scheduler
entrypoint: ./scripts/airflow-entrypoint.sh
depends_on:
- postgres
env_file:
- .env
ports:
- "8794:8793"
volumes:
- ./dags:/opt/airflow/dags
- ./airflow-logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
webserver:
build:
context: .
dockerfile: Dockerfile
hostname: webserver
restart: always
depends_on:
- postgres
command: webserver
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
- ./airflow-logs:/opt/airflow/logs
ports:
- "8088:8080"
entrypoint: ./scripts/airflow-entrypoint.sh
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 32

First service is postgres which we are using as our metadata database.

we are mapping the port 5432 of postgres container with 5434 of our local machine.

Second service is scheduler.

Now, we are using Dockerfile as the image and for this service to start metadata database should be up and running.

airflow-logs would be created on our local machine, where all the logs would be stored.

At last, we have is webserver. It would take all environment variables from .env file and would execute all commands from airflow-entrypoint.sh at the time when container starts.

Now, what is in dags folder is not scope of this blog. We will discuss that later.

I am just pasting an example, do not worry what’s in the file. We will discuss it in details in next blogs.

dags/pipeline.py

import logging
import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from datetime import timedelta

sample = Variable.get("sample")

logging.basicConfig(format="%(name)s-%(levelname)s-%(asctime)s-%(message)s", level=logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


def create_dag(dag_id):
default_args = {
"owner": "jyoti",
"description": (
"DAG to explain airflow concepts"
),
"depends_on_past": False,
"start_date": datetime.datetime(2021, 6, 1),
"retries": 1,
"retry_delay": timedelta(minutes=1),
"provide_context": True,
"pool": "custom_pool"
}
new_dag = DAG(
dag_id,
default_args=default_args,
schedule_interval=timedelta(minutes=60),
)

def party(**kwargs):
logger.info("party")
print(sample)

def order_cake(**kwargs):
logger.info("order_cake")

def decorations(**kwargs):
logger.info("decorations")

def invite_friends(**kwargs):
logger.info("Invite Friends")
return "xcom reply"

with new_dag:
invite_friends = PythonOperator(task_id='invite_friends',
python_callable=invite_friends,
provide_context=True)
decorations = PythonOperator(task_id='decorations',
python_callable=decorations,
provide_context=True)
order_cake = PythonOperator(task_id='order_cake',
python_callable=order_cake,
provide_context=True)
party = PythonOperator(task_id='party',
python_callable=party,
provide_context=True)
[invite_friends, decorations, order_cake] >> party
return new_dag


dag_id = "birthday"
globals()[dag_id] = create_dag(dag_id)

Now, let’s get the airflow latest version running.

docker-compose -f docker-compose.yaml up --build

Airflow is up and running!

Airflow webserver default port is 8080 and we are mapping the container’s port 8080 to 8088 of our machine.

Go to : http://localhost:8088

Airflow is up. Now we can login using the user that we created in airlfow-entrypoint.sh

username is admin and password is admin.

Hope you have enjoyed reading the blog!

For complete UI tour of Airflow 2.0, visit https://jyotisachdeva57.medium.com/part-4-airflow-2-0-ui-tour-74da1eae711d

Thank you:)

--

--