Thursday, 17 October 2024

how to install apache airflow on unix(Linux)

 Here's a guide to installing Apache Airflow on a Unix-based system. This guide will walk you through setting up Airflow using Python's pip package manager, configuring Airflow, and starting it to run workflows. Apache Airflow is a platform for orchestrating complex workflows and data pipelines, making it a powerful tool for automation and data processing.

Prerequisites

Before installing Airflow, make sure you have the following prerequisites:

  1. Python: Apache Airflow requires Python (3.7, 3.8, or 3.9).
  2. pip: Ensure you have pip installed.
  3. Virtual Environment (recommended): It’s best to use a virtual environment to avoid conflicts with other Python packages on your system.

Step 1: Set Up a Python Virtual Environment

Setting up a virtual environment will help you manage dependencies independently from the system-wide packages.

bash
# Update the package list and install Python and pip if not already installed sudo apt update sudo apt install python3 python3-pip python3-venv -y # Create a directory for Airflow and navigate to it mkdir airflow-install cd airflow-install # Create a virtual environment python3 -m venv airflow_venv # Activate the virtual environment source airflow_venv/bin/activate

Step 2: Install Apache Airflow

Apache Airflow installation requires setting the AIRFLOW_HOME environment variable, which designates where Airflow should store its files (configurations, logs, etc.).

bash
# Set AIRFLOW_HOME environment variable export AIRFLOW_HOME=~/airflow # Install Apache Airflow using pip pip install apache-airflow # Install additional packages for PostgreSQL and MySQL (optional, as needed) pip install apache-airflow[postgres] pip install apache-airflow[mysql]

Note: You can also install other providers or packages based on your use case, such as for AWS, Google Cloud, or Kubernetes. Use pip install apache-airflow[extra] where extra is the desired integration.

Step 3: Initialize Airflow Database

Airflow uses a database to keep track of metadata, task instances, and other details. By default, it uses SQLite for development.

bash
# Initialize the Airflow database airflow db init

This command creates the necessary tables in the database specified by the configuration.

Step 4: Create an Admin User for Airflow Web UI

To access the Airflow web UI, you need to create an admin user.

bash
# Create an Airflow admin user airflow users create \ --username admin \ --firstname FIRST_NAME \ --lastname LAST_NAME \ --role Admin \ --email admin@example.com

You will be prompted to enter a password. Once completed, you can use this account to log into the Airflow UI.

Step 5: Start the Airflow Services

You need to start the Airflow web server and scheduler for Airflow to run. These are the two essential components of Airflow:

  1. Web Server: Provides the Airflow UI.
  2. Scheduler: Handles scheduling tasks according to DAGs (Directed Acyclic Graphs).
bash
# Start the Airflow web server airflow webserver --port 8080

Open a new terminal window, activate the virtual environment, and start the scheduler:

bash
# Start the scheduler airflow scheduler

The Airflow web server should now be running on http://localhost:8080, and you can log in with the credentials you set up earlier.

Step 6: Create and Run Your First DAG

  1. Navigate to $AIRFLOW_HOME/dags and create a new Python file for your DAG.
  2. Use the Airflow UI to monitor the DAG’s execution status and logs.

Example DAG

Here’s a quick example of a DAG that prints "Hello World":

python
from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime, timedelta def print_hello(): print("Hello World") default_args = { 'owner': 'airflow', 'start_date': datetime(2023, 10, 17), 'retries': 1, 'retry_delay': timedelta(minutes=5), } dag = DAG( 'hello_world', default_args=default_args, description='A simple Hello World DAG', schedule_interval=timedelta(days=1), ) hello_task = PythonOperator( task_id='hello_task', python_callable=print_hello, dag=dag, )

Place this Python file in the dags folder, and it will automatically appear in the Airflow UI once Airflow detects the file.

Step 7: Stop Airflow

When you are done, you can stop the Airflow services by pressing Ctrl + C in the terminal where the web server and scheduler are running.

Troubleshooting Tips

  • Database Errors: Ensure SQLite or the chosen database service is running correctly and is accessible.
  • Dependency Conflicts: Installing Airflow in a virtual environment helps prevent dependency issues.
  • Port Conflicts: The default Airflow web server runs on port 8080. If this is in use, specify a different port using --port.

Conclusion

Now you have a running instance of Apache Airflow on your Unix-based system. You can start building and scheduling workflows, automate data pipelines, and integrate with various external systems. Remember to secure the web server appropriately if deploying in a production environment.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.