Apache Airflow introduction and best practices

Apache Airflow introduction and best practices


Introduction to Apache Airflow

What is Airflow?

# Developed by 'Airbnb'
# Airflow is a 'Python script' that defines an Airflow DAG object, This object can then be used in Python to code the ETL process.
Airflow uses 'Jinja Templating', which provides built-in parameters and macros (Jinja is a templating language for python, modelled after Django templates) for Python programming.

DAGs

# Hooks (Connections)
# Operators (tasks)
# Schedule
# Dependencies

Airflow Components

1. UI (can check the current status of executing DAGs / restart a DSG / create state)
2. Scheduler (periodically wakeup, see what needs to be running, and tell executor to run it)
3. Executor(s) (Executing tasks)
    # Sequential (Only running one task at a time. Don't run in production. Change Airflow config before running in production)
    # Local (Allows to run a task in parallel)
    # Celery (basically it's a little library that allows building message queue in Python on top of like RabbitMQ)

Airflow process pipelines

1. ETL Pipelines
2. Machine Learning Pipelines
3. Predictive Data Pipelines
    # Fraud Detection, Scoring/Ranking, Classification, Recommended System
4. General Job Scheduling (e.g Cron)
    # DB Back-ups, Scheduled code/config deployment

Alternate to Airflow

1. 'Apache Oozie' is a workflow scheduler system to manage Apache Hadoop jobs. Oozie workflow jobs Directed Acyclical Graphs (DAGs) of action. Oozie coordinator jobs are recurrent oozie workflow jobs triggered by time (frequency) and data availability.
2. 'Azkaban' is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
3. 'Luigi' by Spotify is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. Its also comes with Hadoop support built in.

Comparing Airflow with Oozie and Azkaban

1. Airflow written in python and DAGs are in Python as well.
2. Oozie and Azkaban pretty tightly coupled with Hadoop (Comparatively Azkaban is less), but Airflow can use with 'Google cloud' or 'Amazon Redshift' for data warehousing solution.
3. Collaboratory run ETL features, like its unique and powerful by the way of model time and executions so on.
4. Comunity in Airflow is really pretty strong.

Advanced features

1. 'XCom' allows you to persist state between executions. which is actually useful that what we want up from 'Azkaban'.
2. 'Variables' use via UI as environment variables, can get access in DAGs.(Also possible to use in distributed mode)
3. 'Branching' to toggle the flow of DAGs to change logic.
4. 'Pools' to restricting some tasks (Ex: Not more than 10 task can be queried at the same time).
5. Queries
6. 'SLAs' ability to define 'If this task takes more than an hour, Let me know'
7. 'Triggers' for wait for data to arrive
8. 'Backfill' to do the task on history data (Ex: Run this task for the last three years data, to get everyday snapshot)
9. Hooks/Operators
10. 'Templating' uses 'Jinja' Python templating system

Helpful links

1. https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow
2. https://wecode.wepay.com/posts/airflow-wepay
3. https://airflow.incubator.apache.org/
4. https://github.com/apache/incubator-airflow

Comments

Popular Posts