Intro
Last year, I joined a team responsible for a mature data platform with several tools and data products already running in production. From day one, the data stack was easy to understand: each layer had a dedicated tool and a clear responsibility. Data ingestion, transformations with dbt, orchestration with Airflow, and Metabase for analytics and reporting.
This separation of concerns made it much easier to understand the architecture without getting lost in unnecessary details during my onboarding process.
Overall, the platform was working properly, but I quickly noticed a few issues that kept bothering me:
- A significant number of Airflow DAGs were consistently missing their SLA targets. Due to the dependency chain between DAGs, a failure or delay in an upstream workflow could impact multiple downstream workflows, even when the dependency wasn't strictly required. The same applied to SLA monitoring: a delayed DAG could cause dependent DAGs to miss their SLAs regardless of the actual state of the upstream process.
- Analytical queries taking up to two minutes to complete. Even simple Metabase dashboards required a noticeable amount of time to refresh after applying a filter.
- Increasing Snowflake credit consumption. Usage was growing month over month without a clear explanation, making it difficult to justify infrastructure costs.
One improvement at a time
The first thing I did was put all these issues on the table. Seeing the full picture helps me prioritize work based on overall impact rather than treating each problem in isolation.
For example, Snowflake credit consumption was probably the most urgent concern from a budget perspective. However, if data is not reaching the warehouse because missed SLAs are breaking DAG chains, data freshness is delayed by several hours, and dashboards take minutes to load, the problem goes beyond technical debt and becomes an organizational pain point.
With that context, the first priority became clear: improve the reliability of our Airflow pipelines and reduce operational friction before tackling optimization efforts elsewhere.
One line, three times faster
This helper function defines a DAG task that depends on a task from another DAG:
def externalDependency(dag, ext_dag, task_id, timeout=5400):
return ExternalTaskSensor(
dag=dag,
task_id="ext_{}".format(task_id),
external_dag_id=ext_dag,
external_task_id=task_id,
timeout=timeout,
mode="poke",
poke_interval=60,
execution_timeout=timedelta(seconds=timeout)
)
This was one of the core dependencies in a DAG that unlocked a large portion of the dbt model graph powering our analytical workloads.
At first glance, everything looked fine. However, while investigating the SLA issues, I noticed that many tasks were spending an unusual amount of time in the queue waiting for an available worker. Looking deeper into the execution patterns, a significant portion of the workers were occupied by sensors simply waiting for upstream DAGs to finish.
The real problem was hidden in a single configuration parameter: mode="poke".
When an ExternalTaskSensor runs in poke mode, it continuously occupies a worker slot while checking whether the target task has reached the expected state. If the timeout is set to one hour, that worker remains busy for up to an hour, waking up periodically to perform the check.
To put this into perspective, imagine having 10 available workers and 11 tasks waiting to run. One task will have to wait simply because all worker slots are occupied, even if some of them are just waiting for external conditions.
In many environments this may not be a significant issue. However, when a critical DAG unlocks dozens of downstream DAGs, worker saturation can quickly become a bottleneck, especially when combined with strict SLAs.
If DAG A fails, DAG B will either fail as well or never get triggered. As more dependencies are added to the graph, your data warehouse may go hours without receiving fresh data.
In the best-case scenario, someone fixes DAG A, clears the dependency chain, and everything resumes normally. The downside is that you've already lost hours of data freshness and operational time.
To mitigate this, I changed a single line:
def externalDependency(dag, ext_dag, task_id, timeout=5400):
return ExternalTaskSensor(
dag=dag,
task_id="ext_{}".format(task_id),
external_dag_id=ext_dag,
external_task_id=task_id,
timeout=timeout,
mode="reschedule",
poke_interval=60,
execution_timeout=timedelta(seconds=timeout)
)
At first glance, reschedule mode looks very similar to poke, but the behavior is fundamentally different.
While poke keeps the worker occupied for the entire waiting period, reschedule releases the worker after each check and schedules itself to run again later. This allows other tasks and DAGs to use the worker slot in the meantime.
In practice, this means that delays or temporary issues in upstream workflows no longer block the entire execution environment.
The impact was immediate:
- The complete DAG graph execution time dropped from up to 3 hours to typically 30–45 minutes.
- End-to-end data latency was reduced from roughly 2 hours to less than 45 minutes.
The real win wasn't reducing execution time. It was restoring predictable data freshness across the organization.
When analytical teams know when data will arrive, dashboards become more reliable, troubleshooting becomes easier, and operational trust in the platform improves.
Ultimately, these improvements translated into fresher data reaching the warehouse much faster.
There were other options available as well. For example, creating dedicated Airflow Pools and assigning different priority_weight values could help prioritize critical workloads. Increasing parallelism or adding more workers was also possible, but scaling infrastructure wasn't the goal in our case.
Today, deferrable operators would probably be my first choice, since they achieve a similar outcome more efficiently by offloading waiting tasks to the Triggerer process.
Unfortunately, our Airflow version was too old to support them, and upgrading the platform was a larger project on its own.
Two birds, one stone: reducing query time and Snowflake costs
Now that data was reaching the warehouse on time, it was time to tackle the next challenge:
- Why were seemingly simple dashboards taking so long to refresh?
- Why was Snowflake credit consumption growing month over month?
As it turned out, both problems shared the same root cause.
Several of our most important models, consumed by dashboards and downstream transformations across the organization, were materialized as views.
At first, this made sense. These models sat at the center of the dependency graph, and changes in upstream data could affect the final result. Views guaranteed that every query always returned the latest version of the data.
The downside was that every dashboard query triggered the view computation again.
And again.
And again.
As more joins, columns, and upstream dependencies were added over time, these views became increasingly expensive to execute. Every Metabase question paid the full cost of rebuilding a significant portion of the transformation graph before returning a result.
The impact showed up in two places:
- Dashboard queries taking several minutes to complete.
- Snowflake credit consumption steadily increasing month after month.
My solution wasn't particularly fancy, but it was highly effective.
First, I changed the materialization strategy of the core models from view to table.
This allowed analytical queries to read precomputed data instead of recomputing the same transformations every time a user opened a dashboard or applied a filter.
The obvious concern was data freshness.
To solve that, I introduced an additional Airflow DAG whose sole responsibility was keeping these core models updated on a frequent schedule.
This introduced a small freshness delay — typically no more than 5 to 10 minutes — but it was a tradeoff we were comfortable making considering the gains:
- Up to 44% reduction in monthly Snowflake credit consumption.
- Average query execution time reduced from several minutes to under 10 seconds.
- Approximately 95% reduction in p95 query latency.
- Faster dashboard refresh times and a significantly better user experience in Metabase.
Most importantly, we solved both problems with the same change.
Queries became dramatically faster, and Snowflake stopped spending credits recomputing the same logic for every dashboard request.
Depending on your Snowflake contract and workload size, improvements like these can easily translate into thousands of dollars saved per month.
Final thoughts
There are certainly better solutions and more sophisticated approaches to the challenges I described here.
The goal of this post wasn't to present groundbreaking engineering work, but to share a reminder that sometimes the highest-impact improvements come from understanding the problem deeply rather than introducing new technology.
In this case, a handful of relatively small changes helped reduce operational overhead, improve data freshness, lower infrastructure costs, and deliver a much better experience for data consumers.
Most importantly, they allowed the team to spend less time fighting recurring issues and more time building new things.
For example, one of the next initiatives on my list involved building infrastructure to support AI-driven applications. But that's a story for another post.
What's next?
Unfortunately, that chapter came to an end after a company-wide restructuring earlier this month.
Even so, working on problems like these reinforced what I enjoy most: understanding how systems behave, identifying bottlenecks, and finding pragmatic solutions that create measurable impact.
I'm currently looking for my next opportunity in Data Engineering, Analytics Engineering, or related platform-focused roles.
I enjoy working across the full lifecycle of data products, from infrastructure and orchestration to modeling, performance optimization, and enabling better decisions through data.
If this article resonated with you, or if you're working on similar challenges, feel free to reach out. I'd be happy to connect and exchange ideas.














