WiseAnalytics | 5 considerations to have when using Airflow

Insights

5 considerations to have when using Airflow

6 min read

By Julien Kervizic

In previous posts, I have explained the basics of Airflow and how to set up Airflow on azure. I haven’t, however, covered what considerations we should give when using Airflow.

I see five primary considerations to have when using Airflow:

What type of infrastructure to set up to support it
What kind of operator model to abide by, and which operators to choose
How to architect your different DAGs and setup your tasks
Whether to leverage templated code or not
Whether and how to use it’s REST API

These considerations will dictate how you and your team will be using Airflow and how it will be managed.

(1) Airflow Infrastructure — Go for a Managed Service if Possible

Setting up and maintaining Airflow isn’t so easy if you need to set it up, you will most likely need quite a bit more than the base image:

Encryption needs to be set up to safely store secrets and credentials
Setting up an authorization layer, if only through the flask login setup and preferably through an oAuth2 provider such as google
SSL needs to be configured
The web server needs to be moved to a more production-ready setup (for example using WSGI/Nginx)
Libraries and drivers need to be installed to support the different types of operations you wish to handle
...

For the most simple use cases, it is possible to rely solely on the local executor. Still, once real processing need arise, more distributed computation need arise, and management of the infrastructure becomes more complicated.

They also require more resources to run than a Local executor setup, where worker, scheduler and web-server can lie in the same container:

Celery executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Flower (monitoring), Scheduler, Worker
Mesos Executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Mesos infra
Kubernetes: Webserver (UI), Postgres (Metadata) and Scheduler, Kubernetes infra

The high number of components will raise the complexity, make it harder to maintain and debug problems requiring that one understand how the Celery executor works with Airflow or how to interact with Kubernetes.

Managed version of Airflow exists on Google Cloud, through Cloud Composer, and Astronomer.io also offers managed versions, Qubole offers it as part of its’ data platform. Where applicable, it is more than recommended to go for a managed version than setting up and managing this infrastructure yourself.

(2) Sensors, Hooks and Operators — Find your fit

Depending on your use case, you might want to be able to use certain sensors, hooks, or operators. And while Airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon use case, you will probably need to check in user-contributed operators list or develop your own.

Understanding how to use operators, depending on your particular company setup, is also important. Some have a radical stance with respect to the operator, but the reality is that the use of operators needs to be taken in the context of your company.

Does your company have an engineering bias that supports the use of Kubernetes or other container style instances?
Is your company use of Airflow, more driven by your Data-Science department, with little engineering support? For them, it might make more sense to use a python operator or the still pending R operator
Is your company only planning to use Airflow to operate data transfers (Sftp/S3 …) and SQL queries to maintain a data-warehouse? For them using K8s or any container instances would be overkill. This is, for example, the approach taken at Fetchr, where most of the processing is done in ERM/Presto.

Selecting your operator setup is not a one size fit all.

(3) DAGS — Keep them simple

There are quite a few ways to architect your DAGS in Airflow, but as a general rule, it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.

When dealing with a lot of data-sources and interdependencies, things can get messy. Setting up dags as self-contained files, kept as simple as possible, can go a long way to make your code maintainability. The external task sensor helps to separate DAG and their dependencies in multiple self-contained DAGS.

As in most distributed systems, it is important to set up operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.

Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.

Communication between tasks, although possible with XCom, should be minimized as much as possible in favor of self-containing functions/operators. This makes the code more legible, stateless, and unless you want to be able to only re-run this part of the operation, do not justify the use of these. Dynamic Dags are one of the notable exceptions to this.

(4) Templates and Macros — Legible Code

Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated for execution with variables fitted or computed by the context. Templates can provide more readable alternatives to direct string manipulation in python (e.g., through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.

Macros provide a way to take further advantage of templating by exposing objects and functions to the templating engine. Users can leverage a set of default macros, or customize theirs at a global or DAG level.

Using templated code does however, take you away from vanilla python and exposes one more layer of complexity for engineers typically needing to leverage quite a large array of technologies and APIs.

Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python-format commands, for example, but it can make the code more legible.

(5) Event-Driven — REST API for building Data Products

Airflows’ REST API allows for the creation of event-driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:

The rest API allows for building, data product applications built on top of Airflow, with use cases such as:

Spanning out clusters and processing based on an HTTP request
Setting up a workflow based on a message or file appearing in respectively a message topic or blog storage
Building fulling fledge Machine Learning platforms.

Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform, and possibly code that are used for more traditional data processing.

‍