What Is Airflow?
Let me be honest about how I first encountered Airflow. My team had a bunch of Python and SQL scripts running on cron jobs. One script would download data, another would load it into a staging database, a third one would transform it, and a fourth one would ingest it into our data warehouse. The scripts were chained together with some fragile shell glue. When things worked, nobody noticed. When things broke β and they broke often β figuring out where it broke and why was a nightmare.
Someone on the team said "we should use Airflow." I nodded like I understood, then spent the next three evenings reading documentation.
Here's what I wish someone had told me then: Airflow is a platform for authoring, scheduling, and monitoring workflows β workflows that you define as Python code. That's it. No magic. You write a Python file describing what tasks to run and in what order, and Airflow takes care of the scheduling, retries, logging, and visibility.
The thing that took me a while to internalize is that Airflow is not a data processing engine. It doesn't move your data around. It orchestrates the things that do. It's the conductor, not the orchestra.
The DAG β Everything Starts Here
Before diving into the architecture, you need to understand what a DAG is, because the whole system revolves around it.
DAG stands for Directed Acyclic Graph. Sounds fancy. In practice it just means: a set of tasks with dependencies between them, where you can't have circular dependencies (task A depends on task B depends on task A β that would be a cycle, and Airflow won't allow it).
A simple DAG might look like this:
extract_data β transform_data β load_to_warehouse
You define this in a Python file. Each step is a Task β a single unit of work, like running a Python function, executing a SQL query, or calling an external API. The arrows between them are dependencies. Airflow reads your Python file, figures out the task ordering, and then knows how and when to execute things.
In Airflow 3.x, DAG files still live in a dags/ folder, but there have been some meaningful changes to how they're parsed and distributed β we'll get to that when we talk about the DAG Processor.
The Architecture: A Bird's Eye View
Before we look at each component individually, here's a high-level view of how they fit together:
Every component reads from or writes to the Metadata Database in some way. That database is the single source of truth for the entire system. Let's now go through each piece.
Why Understanding the Architecture Matters
If you're running Airflow locally, many of these components may live on the same machine and feel invisible.
So why spend time learning the architecture?
Because most real-world Airflow problems come down to understanding which component is responsible for what.
For example:
- DAG not showing up in the UI? Look at DAG parsing.
- Tasks stuck in queued state? Check the executor and workers.
- No new tasks are starting? Investigate the scheduler.
- UI errors or failed API requests? Check the API Server.
- Everything seems broken? The metadata database is often the first place to look.
You don't need to memorize every component. But having a mental model of how they fit together makes troubleshooting much easier as you move beyond simple local deployments.
The Metadata Database
I'm starting here because once you understand this, everything else makes sense.
The metadata database is a relational database β Postgres in production (please use Postgres, not SQLite outside of local dev). It stores:
- All your DAG definitions and their current state
- Every DAG Run β every time a DAG has executed or is scheduled to execute
- Every Task Instance β every individual task run, its state (queued, running, success, failed), pointers to logs, and timing
- User information, connections, variables, and more
Every other component in Airflow talks to this database. The Scheduler reads from it to decide what to run. Workers write to it when tasks finish. The API Server and UI read from it to show you that nice UI. The database is the glue that holds the distributed system together.
One practical implication: if your database is slow or down, your entire Airflow cluster is effectively crippled. Invest in your database setup.
The Scheduler
The Scheduler is the brain of the operation. It's a long-running process that continuously checks the metadata database, figures out which tasks are ready to run, and sends them off to be executed.
Here's a rough version of what it does in a loop:
- Check all DAG schedules β is there a DAG that's supposed to have run in the last interval and hasn't started yet?
- Create DAG Runs and Task Instances in the metadata database for anything that needs to run
- Look at all Task Instances that are in a "scheduled" state and evaluate their dependencies β are the upstream tasks done?
- Move eligible tasks to "queued" state and submit them to the executor
The Scheduler doesn't run tasks itself. It decides that tasks should run and tells the executor to handle the actual execution. That's an important distinction.
In Airflow 3.x, the Scheduler has become more reliable and scalable. You can run multiple schedulers (active-active setup) for high availability. If one scheduler process dies, another picks up the work.
One gotcha with the Scheduler: it needs up-to-date DAG metadata in order to schedule work. In Airflow 3.x, DAG parsing responsibilities can be delegated to a dedicated DAG Processor component. More on that below.
The DAG Processor (New in Airflow 3.x)
This is one of the more significant architectural changes in Airflow 3.x, and one that beginners often don't realize exists.
In older versions of Airflow, the Scheduler also handled parsing your DAG files. This was problematic. If a DAG file had a slow import or a bug that caused it to hang, it could slow down the Scheduler itself. Not great.
In Airflow 3.x, DAG parsing has been extracted into its own DAG Processor component. It runs as a separate process, continuously scanning your dags/ directory, importing the Python files, and extracting DAG metadata (schedule, tasks, dependencies) into the metadata database.
The Scheduler then just reads from the database β it doesn't need to parse Python files directly anymore. Cleaner separation of concerns.
To be precise, in a basic singleβmachine setup the scheduler still spawns subprocesses to parse DAGs for you; the dedicated DAG Processor is mainly a production pattern. The important idea is that, in 3.x, you can keep user DAG code out of the core scheduler process and scale DAG parsing independently when you need to.
The API Server
Airflow 3.x introduced a proper standalone API Server. In earlier versions, the REST API was bundled into the Webserver.
The API Server exposes Airflow's REST API β it's what the UI uses to fetch DAG information, trigger runs, clear tasks, etc. It's also what you'd use if you want to trigger DAG runs programmatically from another system. CI/CD pipelines kicking off DAGs, external services triggering workflows β all of that goes through the API Server.
What changed from 2.x is that the REST API is now a first-class interface rather than a secondary feature attached to the Webserver. This separation gives Airflow a cleaner architecture and makes it easier to evolve the UI and API independently over time.
The Webserver (Airflow UI)
In Airflow 3.x, the traditional standalone webserver role has largely been replaced by a UI that communicates with the API Server.
You use Airflow UI to:
- See all your DAGs and their last run status
- View the graph of a DAG β which tasks ran, which failed, which are running right now
- Look at logs for individual task runs
- Manually trigger DAG runs or clear failed tasks to retry them
- Manage connections and variables
In Airflow 3.x, the UI has been significantly revamped. It's faster, cleaner, and the new Grid View is much more useful than the old tree view for understanding the history of a DAG across many runs.
The Airflow UI is stateless with respect to actual scheduling. It doesn't schedule anything. It's purely a read/write interface to the metadata database (via the API Server). If the API Server or the UI crashes, your DAGs keep running. Your visibility goes away, but the work continues.
The Executor
The Executor is one of the most misunderstood Airflow concepts.
The Executor is not a separate process you run β it's a component within the Scheduler that determines how tasks actually get executed. Think of it as the strategy the Scheduler uses for dispatching work, not something you run independently.
There are a few executor types:
LocalExecutor β Tasks run as subprocesses on the same machine as the Scheduler. Simple, works well for small setups. Not suitable for real scale because everything shares one machine.
CeleryExecutor β Tasks are sent to a Celery task queue (backed by Redis or RabbitMQ), and picked up by Worker processes that can run on different machines. This is the classic horizontally scalable setup.
KubernetesExecutor β Each task spins up a new Kubernetes Pod to run in, and the pod is destroyed when the task finishes. Very clean isolation, great for containerized environments. More overhead per task but excellent for bursty workloads.
Some deployments use hybrid execution models that combine Celery workers with Kubernetes-based execution.
The choice of executor has massive implications for how you deploy and scale Airflow. Start with LocalExecutor if you're just learning. Move to CeleryExecutor or KubernetesExecutor as you grow.
The Workers
Workers are the things that actually execute your task code. They're separate processes (or pods, in the Kubernetes case) that pull tasks from the queue and run them. They only exist as separate processes if you're using CeleryExecutor or KubernetesExecutor. With LocalExecutor, there isnβt a separate worker service: tasks run as subprocesses inside the scheduler machine.
When a task runs on a Worker:
- The Worker picks up the task from the queue
- It executes your Python function (or Bash command, or Spark job, or whatever the operator does)
- It writes the result back to the metadata database β success or failure
- It ships logs to wherever logs are configured to go (local filesystem, S3, GCS, etc.)
Workers don't need to know about your full DAG structure. They just need to know "run this task." All the DAG context they need is passed along with the task message.
One important thing: in a multi-worker setup, every worker machine needs access to the same DAG files. Otherwise, the worker won't be able to find the code it's supposed to run. This is commonly solved by mounting a shared filesystem or using Git Sync to pull DAGs onto every worker.
The Triggerer
One component I haven't mentioned yet: the Triggerer. It's an optional process that handles deferred tasks β tasks that are waiting on some external event (like a file landing in S3, or a sensor waiting for a condition) without occupying a worker slot while they wait. It runs deferred tasks in an asyncio event loop, which is far more efficient than a worker sitting idle. If you're not using deferrable operators, you don't strictly need it β but most production setups run it.
Connections and Variables
These aren't really "architecture" in the strict sense, but they're part of the runtime infrastructure and worth understanding early.
Connections are how Airflow stores credentials and connection info for external systems β your database host/port/user/password, your AWS credentials, your Snowflake account info. When you use an operator (like PostgresOperator or S3Hook), it looks up a named connection from the metadata database rather than you hardcoding credentials in your DAG.
Variables are just key-value pairs stored in the metadata database. Useful for config values you want to change without editing DAG code.
Both connections and variables can be managed through the UI, the API, or environment variables (the latter being preferred in production from a secrets management perspective).
Putting It All Together
Let's see what actually happens when a scheduled DAG runs, so all these pieces connect:
Your DAG file lives in the
dags/folder. The DAG Processor picks it up, parses it, and writes the DAG structure to the Metadata Database.The Scheduler wakes up (it's running in a loop, typically every few seconds). It checks the database and sees that your DAG is scheduled to run at 2:00 AM. It's now 2:00 AM. The Scheduler creates a DAG Run record and individual Task Instance records in the database.
The Scheduler evaluates task dependencies. Task A has no upstream dependencies, so it becomes eligible. The Scheduler flips its state to "queued" and tells the Executor to execute it.
The Executor places Task A into the task queue (Redis/RabbitMQ when using CeleryExecutor).
An available Worker picks up Task A from the queue. It executes your Python function. If it succeeds, the Worker marks the Task Instance as "success" in the database.
Back to the Scheduler β on the next loop, it sees Task A is done. Task B was waiting on Task A. Now Task B's dependencies are satisfied, so it becomes eligible. The cycle repeats.
Throughout all of this, the API Server is answering requests from your browser, showing you the DAG Run and its task states in real time.
Everything is written to the Metadata Database along the way.
What if one of the components is down?
| Component | What breaks if it's down? |
|---|---|
| Scheduler | No new tasks start |
| Workers | Tasks stop executing |
| Metadata DB | Entire platform affected |
| API Server | UI/API operations fail |
| Triggerer | Deferrable tasks stop progressing |
A Note on Airflow 3.x Specifically
Airflow 3.x brought a bunch of changes, and if you're reading older tutorials, some things will look different. The key differences to be aware of:
The DAG authoring syntax got cleaner β the @dag and @task decorators (from the TaskFlow API) are now first-class citizens and the recommended way to write most DAGs. The old Operator-heavy style still works but new code should prefer TaskFlow.
The UI was rebuilt from scratch β it's a modern React app now, not the old Flask/Jinja UI. Grid view is the default and it's significantly better.
The Scheduler and DAG Processor are now clearly separated β as described above. Relevant if you're configuring a production deployment.
The REST API is now the proper interface for programmatic access β no more hacking around with CLI commands for automation.
Where to Go From Here
If you've followed this far, you understand the fundamental architecture. The next things worth digging into:
- Write your first DAG using the TaskFlow API (
@dag,@taskdecorators) - Understand what Operators are β they're the building blocks of tasks (PythonOperator, BashOperator, and so on)
- Set up a local Airflow environment with Docker Compose (the official
docker-compose.yamlfrom the Airflow docs is a good start) - Learn about XComs β that's how tasks pass data to each other
- Understand task retries and the
on_failure_callbackβ essential for production use
The architecture seems complex at first, but once you've run Airflow a few times, you start to develop intuition for which component is causing problems when things go wrong. And things will go wrong. That's part of the fun. Good luck.


