Thursday 17 October 2024

How to Install Apache Airflow on Docker

 Apache Airflow is a powerful open-source platform for orchestrating complex workflows and data pipelines. With its ability to programmatically author, schedule, and monitor workflows, it has become a go-to tool for data engineers and analysts. Running Airflow on Docker provides an easy way to set up and manage Airflow instances without needing to worry about the underlying infrastructure. In this article, we’ll guide you through the process of installing Apache Airflow on Docker.

Prerequisites

Before you begin, ensure you have the following:

  1. Docker: Make sure you have Docker installed on your machine. You can download and install Docker from Docker's official website.

  2. Docker Compose: This tool is used for defining and running multi-container Docker applications. It usually comes bundled with Docker Desktop installations.

  3. Basic Knowledge of Docker: Familiarity with Docker commands and concepts will help you understand the installation process better.

Step 1: Set Up the Airflow Directory

  1. Create a directory for your Airflow installation:

    bash
    mkdir airflow-docker cd airflow-docker
  2. Inside this directory, create a docker-compose.yaml file. This file will define the services, networks, and volumes used by your Airflow setup.

Step 2: Create the Docker Compose File

Below is a basic docker-compose.yaml configuration to get you started with Apache Airflow:

yaml
version: '3.8' services: airflow-webserver: image: apache/airflow:2.6.0 restart: always environment: - AIRFLOW__CORE__EXECUTOR=LocalExecutor - AIRFLOW__CORE__FERNET_KEY=${FERNET_KEY} - AIRFLOW__CORE__SQL_ALCHEMY_CONN=${SQL_ALCHEMY_CONN} - AIRFLOW__WEBSERVER__SECRET_KEY=${SECRET_KEY} ports: - "8080:8080" depends_on: - airflow-scheduler - airflow-postgres volumes: - ./dags:/usr/local/airflow/dags - ./logs:/usr/local/airflow/logs - ./plugins:/usr/local/airflow/plugins airflow-scheduler: image: apache/airflow:2.6.0 restart: always depends_on: - airflow-postgres command: scheduler volumes: - ./dags:/usr/local/airflow/dags - ./logs:/usr/local/airflow/logs - ./plugins:/usr/local/airflow/plugins airflow-postgres: image: postgres:13 restart: always environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - postgres_data:/var/lib/postgresql/data volumes: postgres_data:

Explanation of the Compose File

  • airflow-webserver: This service runs the Airflow web server, which provides a user interface for monitoring and managing workflows.

  • airflow-scheduler: This service schedules the execution of workflows.

  • airflow-postgres: This service uses PostgreSQL as the backend database for storing metadata.

  • Volumes: The configuration uses Docker volumes to persist data across container restarts.

Step 3: Set Environment Variables

Create a .env file in the same directory as your docker-compose.yaml file to store environment variables:

bash
# .env FERNET_KEY=your_fernet_key_here SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@airflow-postgres/airflow SECRET_KEY=your_secret_key_here

You can generate a Fernet key using the following Python script:

python
from cryptography.fernet import Fernet print(Fernet.generate_key().decode())

Step 4: Start the Airflow Services

With everything set up, you can start the Airflow services using Docker Compose:

bash
docker-compose up -d

This command will pull the required Docker images and start the containers in detached mode.

Step 5: Initialize the Database

Once the services are running, you need to initialize the Airflow database. You can do this with the following command:

bash
docker-compose exec airflow-webserver airflow db init

This command initializes the database with the necessary tables.

Step 6: Access the Airflow UI

Open your web browser and navigate to http://localhost:8080. You should see the Apache Airflow web interface. The default credentials are:

  • Username: airflow
  • Password: airflow

Step 7: Create Your First DAG

To create your first Directed Acyclic Graph (DAG), place your Python scripts in the dags folder that you defined in the docker-compose.yaml file. Here’s a simple example DAG:

python
from airflow import DAG from airflow.operators.dummy import DummyOperator from datetime import datetime default_args = { 'owner': 'airflow', 'start_date': datetime(2024, 1, 1), } with DAG('example_dag', default_args=default_args, schedule_interval='@daily') as dag: start = DummyOperator(task_id='start') end = DummyOperator(task_id='end') start >> end

Place this script in a file named example_dag.py in the dags directory.

Step 8: Monitor and Manage Your Workflows

You can now use the Airflow UI to monitor the execution of your DAGs, view logs, and manage tasks. As you become more familiar with Airflow, you can explore its extensive capabilities, including scheduling, retries, and task dependencies.

Conclusion

Installing Apache Airflow on Docker simplifies the process of setting up a powerful workflow orchestration tool. With the steps outlined in this article, you can easily create and manage your Airflow instance and start building your data pipelines. Enjoy orchestrating your workflows with Airflow!

No comments:

Post a Comment

Note: only a member of this blog may post a comment.