In previous posts, I have explained the basics of Airflow and how to set up Airflow on azure. I haven’t, however, covered what considerations we should give when using Airflow.
I see five primary considerations to have when using Airflow:
These considerations will dictate how you and your team will be using Airflow and how it will be managed.
Setting up and maintaining Airflow isn’t so easy if you need to set it up, you will most likely need quite a bit more than the base image:
For the most simple use cases, it is possible to rely solely on the local executor. Still, once real processing need arise, more distributed computation need arise, and management of the infrastructure becomes more complicated.
They also require more resources to run than a Local executor setup, where worker, scheduler and web-server can lie in the same container:
The high number of components will raise the complexity, make it harder to maintain and debug problems requiring that one understand how the Celery executor works with Airflow or how to interact with Kubernetes.
Managed version of Airflow exists on Google Cloud, through Cloud Composer, and Astronomer.io also offers managed versions, Qubole offers it as part of its’ data platform. Where applicable, it is more than recommended to go for a managed version than setting up and managing this infrastructure yourself.
Depending on your use case, you might want to be able to use certain sensors, hooks, or operators. And while Airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon use case, you will probably need to check in user-contributed operators list or develop your own.
Understanding how to use operators, depending on your particular company setup, is also important. Some have a radical stance with respect to the operator, but the reality is that the use of operators needs to be taken in the context of your company.
Selecting your operator setup is not a one size fit all.
There are quite a few ways to architect your DAGS in Airflow, but as a general rule, it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.
When dealing with a lot of data-sources and interdependencies, things can get messy. Setting up dags as self-contained files, kept as simple as possible, can go a long way to make your code maintainability. The external task sensor helps to separate DAG and their dependencies in multiple self-contained DAGS.
As in most distributed systems, it is important to set up operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.
Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.
Communication between tasks, although possible with XCom, should be minimized as much as possible in favor of self-containing functions/operators. This makes the code more legible, stateless, and unless you want to be able to only re-run this part of the operation, do not justify the use of these. Dynamic Dags are one of the notable exceptions to this.
Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated for execution with variables fitted or computed by the context. Templates can provide more readable alternatives to direct string manipulation in python (e.g., through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.
Macros provide a way to take further advantage of templating by exposing objects and functions to the templating engine. Users can leverage a set of default macros, or customize theirs at a global or DAG level.
Using templated code does however, take you away from vanilla python and exposes one more layer of complexity for engineers typically needing to leverage quite a large array of technologies and APIs.
Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python-format commands, for example, but it can make the code more legible.
Airflows’ REST API allows for the creation of event-driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:
The rest API allows for building, data product applications built on top of Airflow, with use cases such as:
Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform, and possibly code that are used for more traditional data processing.