Apache Airflow.
DAG.
Task.
Operator.
Sensor.
How Airflow works.
Apache Airflow is an open-source tool used for scheduling, authoring, and monitoring data processing workflows.
It is widely used in large data processing systems to automate complex data processing tasks.
Airflow introduces concepts such as "DAG" (Directed Acyclic Graph), "Task," "Operator," and "Sensor" to describe data processing workflows.
Herein, workflows are generally defined with the help of Directed Acyclic Graphs (DAG).
These are created of those tasks that have to be executed along with their associated dependencies.
Every DAG illustrates a group of tasks that you want to run. And, they also showcase the relationship between tasks available in the user interface of Apache Airflow.
Some definitions:
Directed: If you have several tasks that further have dependencies, each one of them would require at least one specific upstream task or downstream task.
Acyclic: Here, tasks are not allowed to create data with self-references. This neglects the possibility of creating an infinite loop.
Graph: Tasks are generally in a logical structure with precisely defined relationships and processes in association with other tasks.
DAG is meant to define how the tasks will be executed and not what specific tasks will be doing.
When a DAG gets executed, it is known as a DAG run.
Let’s assume that you have a DAG scheduled and it should run every hour. This way, every instantiation of the DAG will establish a DAG run. There could be several DAG runs connected to one DAG running simultaneously.
Tasks vary in terms of complexity and they are operators’ instantiations. You can take them up as work units that are showcased by nodes in the DAG. They illustrate the work that is completed at every step of the workflow with real work that will be portrayed by being defined by the operators.
In Apache Airflow, operators are meant to define the work. An operator is much like a class or a template that helps execute a specific task. All of the operators are originated from BaseOperator. You can find operators for a variety of basic tasks, like:
PythonOperator
BashOperator
MySqlOperator
EmailOperator
These operators are generally used to specify actions that must be executed in Python, Bash, MySQL, and Email. In Apache Airflow, you can find three primary types of operators:
Operators that can run until specific conditions are fulfilled
Operators that execute an action or request a different system to execute an action
Operators that can move data from one system to the other
Hooks enable Airflow to interface with third-party systems. With them, you can effortlessly connect with the outside APIs and databases, such as Hive, MySQL, GCS, and many more. Basically, hooks are much like building blocks for operators. There will be no secured information in them. Rather, it is stored in the encrypted metadata database of Airflow.
Between tasks, airflow exceeds at defining complicated relationships. Let’s say that you wish to designate a task and that T1 should get executed before T2. Thus, there will be varying statements that you can use to define this precise relationship, like:
T2 << T1
T1 >> T2
T1.set_downstream (T2)
T2.set_upstream (T1)
To understand how does Apache Airflow works, you must understand there are four major components that create this scalable and robust workflow scheduling platform:
Scheduler: The scheduler monitors all of the DAGs and their linked tasks. For a task, when dependencies are met, the scheduler initiates the task. It checks active tasks for initiation periodically.
Web Server: This is the user interface of Airflow. It displays the status of responsibilities and lets the user interact with databases and read log files from remote file stores, such as Google Cloud Storage, S3, Microsoft Azure blobs, and more.
Database: In the database, the state of the DAGs and their linked tasks are saved to make sure the schedule remembers the information of metadata. Airflow uses Object Relational Mapping (ORM) and SQLAlchemy to connect to the metadata database. The scheduler evaluates all of the DAGs and stores important information, such as task instances, statistics from every run, and schedule intervals.
Executor: The executor zeroes down upon how the work will get done. There is a variety of executors that you can use for diverse use cases, such as SequentialExecutor, LocalExecutor, CeleryExecutor, and KubernetsExecutor.
Airflow evaluates all of the DAGs in the background at a specific period. This period is set with the help of processor_poll_interval config and equals one second. Once a DAG file is evaluated, DAG runs are made as per the parameters of scheduling. Then, task instances are instantiated for such tasks that must be performed and their status is changed to SCHEDULED in the metadata database.
The next step is when the schedule questions the database, retrieves tasks when they are in the scheduled state, and distributes them to all of the executors. And then, the task’s state changes to QUEUED. The queued tasks are drawn from the queue by executors. When this happens, the status of the task is changed to RUNNING.
Once a task is finished, it will be marked as either finished or failed. And then, the scheduler will update the final status in the database.
!pip install apache-airflow
To run the sleep task: airflow run tutorial sleep 2022-12-13
To list tasks in the DAG tutorial: bash-3.2$ airflow list_tasks tutorial
To pause the DAG: airflow pause tutorial
To unpause the tutorial: airflow unpause tutorial