From Chaos to Clarity: How We Orchestrated ~1,800 Databricks Workflows with Apache Airflow
TL;DR
- We migrated from a third-party orchestration solution to Apache Airflow on Google Cloud Composer
- We started governing and triggering ~1,800 already existing Databricks jobs/workflows under a unified model
- Orchestration cost dropped by ~50% compared to the previous year
- A daily routine that used to consume hours of senior engineers' time now takes minutes
The Scale Problem No One Warns You About
Two years ago, the problem was not getting jobs to run. It was finding out, fast enough, why they had stopped, who would be affected, and how much engineering time would be drained before the platform was healthy again.
On bad days, support consumed a disproportionate share of the most experienced engineers’ attention. The work was not solving a clear bug. It was rebuilding context: correlating logs, understanding implicit dependencies, figuring out whether the failure was transient, identifying downstream impact, and deciding who needed to act. The real cost did not show up only in infrastructure. It showed up in engineering time that could no longer be invested in evolving the platform.
That became even more critical because of the scale we operate at. CERC maintains the infrastructure of the Brazilian financial market for registering financial assets, a system that has already registered more than R$5 trillion in financial assets and processes more than 500 million transactions per day. Our DataLake holds more than 3 PB of data, distributed across more than 15 registration systems and more than 8,000 transactional tables, with millions of new records arriving every day.
Hundreds of Databricks jobs already deployed, spread across multiple teams, ingest, transform, and serve this data to consumers ranging from internal risk models to regulatory reporting.
First, it is worth clarifying the solution topology: the data workloads already existed as jobs deployed on Databricks. The problem we needed to solve was not rewriting those jobs, but building a reliable orchestration layer to trigger them, chain dependencies, apply governance, and operate all of that at scale.
At that scale, orchestration is not plumbing. It is the nervous system of the entire platform. And ours was failing.
The third-party tool we used had been enough when the platform was smaller. As volume grew and more teams started depending on it, what had once been tolerable became a daily operational liability. The main pain points were concentrated in four areas:
Low programmability
Retry logic, error handling, and dependencies required proprietary configuration, not Python.
Limited observability
When a job failed, the context did not come with it. Root cause analysis depended on manual correlation between logs and tribal memory.
Weak governance
Changes happened through multiple flows, with no single source of truth for deployment and operation.
Excessive external dependency
Adapting orchestration to the platform's needs required going through a vendor, slowing the team's autonomy.
These were not growing pains to tolerate. They were architectural signals: the orchestration layer had become a liability.
Why Airflow — And Why Not Something Else
Before talking about the solution, it is worth making the decision criteria clear. We did not simply need to swap one tool for another. We needed an orchestration layer that the team could program, version, operate, and evolve with autonomy.
We evaluated three alternatives:
| Tool | Reason Considered | Reason Rejected |
|---|---|---|
| Keep current vendor | Familiar, no migration cost | Root cause of the problem; patching was not viable |
| Databricks Workflows (native) | Native integration, no extra infra | No dependency graph across jobs; limited to Databricks workloads |
| Prefect / Dagster | Modern API, good observability | Smaller ecosystem, fewer production references at our scale; steeper learning curve |
| Apache Airflow on Cloud Composer | ✅ Python-native, widely established standard, mature Databricks integration, managed infrastructure | — |
Apache Airflow won on three decisive criteria. First, it treats pipelines as code: DAGs are Python, versioned, and reviewable. Second, the Airflow Datasets feature (introduced in version 2.4) gave us an explicit way to model data dependencies without polling hacks. Third, Google Cloud Composer delivered what we wanted operationally: a managed, production-ready Airflow environment, without turning the orchestration engine itself into one more problem for the team.
The remaining variable was human capital. We had a senior engineer with deep Airflow knowledge and a clear mandate to decide quickly. That was enough to move from comparison into execution.
The Architecture: Convention Over Configuration at Scale
The design philosophy of the new system can be summarized in one sentence: make the right thing the easy thing. That idea guided everything that came after. Instead of trusting that every engineer would manually repeat the right pattern, we designed the platform to apply that pattern by construction.
The DAG Factory: YAML In, Validated DAGs Out
The central mechanism behind this shift was the DAG Factory: a code generation layer that converts human-readable YAML specifications into validated, structurally consistent Airflow DAGs.
Before it, creating a new pipeline meant writing a Python DAG from scratch, reinterpreting platform conventions, and hoping the end result aligned with expectations around operation, retry, observability, and access. In any team of meaningful size, that inevitably creates too many variations. The factory reverses the equation: the engineer declares what should run, and the platform defines how it will run.
A pipeline specification in practice follows this pattern. The DAG name is the root key, and the schema expresses business context, dependencies, and trigger rules:
# 1) Extraction from the transactional source — triggered by cron
landing-databricks-workflow-name-1:
folder_application: folder-where-this-workflow-belongs
folder_sub_application: ''
date_start: '2025-03-01'
owner: responsible-team
schedule_america_sp: 30 3 * * * # America/Sao_Paulo time zone
tags:
- transient
- {source}
- etc
access:
- group-that-needs-to-see-this-workflow
# 2) Bronze/silver layer — triggered by dataset (when the transient upstream finishes)
bronze-silver-databricks-workflow-name-2:
folder_application: folder-where-this-workflow-belongs
folder_sub_application: ''
date_start: '2025-03-01'
owner: responsible-team
dependencies:
- databricks-workflow-name-1
tags:
- bronze
- silver
- {system}
- {domain}
- etc
access:
- group-that-needs-to-see-this-workflow
# 3) Gold layer — depends on multiple upstreams and triggers parallel stages
gold-databricks-workflow-name-3:
folder_application: folder-where-this-workflow-belongs
folder_sub_application: ''
date_start: '2025-03-01'
owner: responsible-team
dependencies:
- bronze-silver-databricks-workflow-name-2
- another-databricks-workflow
tags:
- gold
- registry
- {system}
- {domain}
- etc
access:
- group-that-needs-to-see-this-workflow
The important point is that there is no orchestration Python for each team to write. Before any DAG is generated, a Pydantic validation layer checks the schema, required fields, and value constraints. Invalid specs die in CI, not during a critical operational window.
DAG Factory Flow
1
YAML Specification
2
Validation with Pydantic
Errors die in CI/CD, not in production
3
DAG Generation
4
Deploy to Google Cloud Composer
Automatic registration of the generated DAG
Every DAG produced by the factory shares the same structural skeleton: standardized task naming, platform retry policies, alert hooks, and access conventions. The cognitive cost of “doing it right” dropped drastically.
More importantly, the platform stopped depending on manual discipline to remain consistent.
Scheduling: Cron-Based and Event-Driven
A fundamental tension in any large data platform is that not all pipelines should run on a clock. Time-based scheduling assumes upstream data will be ready at a predictable time, an assumption that breaks under upstream delays, retries, or SLA failures. The downstream job runs anyway, consuming compute to produce stale or incorrect data.
Our architecture supports two scheduling models, selectable per pipeline:
- Cron-based scheduling — for pipelines with genuinely time-dependent sources
- Airflow Datasets — for pipelines that should run only after the upstream completes, because if the upstream is still running, the downstream cannot produce a correct result
Airflow Datasets provides a first-class data dependency primitive. When a producer DAG completes and marks its output Dataset as updated, all registered consumer DAGs are triggered automatically. Dependencies are declared in code, versioned, and auditable, not inferred from time gaps between cron expressions.
The practical effect was simple and powerful: pipelines started when data was ready, not when a cron expression fired in the hope that everything had already worked.
Reliable Execution: A Custom Operator for Databricks
Airflow’s native integration with Databricks is solid, but it does not cover every operational nuance of our platform. We built CercDatabricksRunNowOperator, an operator that extends the provider’s standard Databricks operator and adds the layers our platform requires:
- Deferrable execution: it uses Airflow’s asynchronous model (
deferrable=True), freeing the worker while it waits for the Databricks job. At scale, this significantly reduces worker slot consumption. - Guaranteed idempotency: it generates an MD5 token from
dag_id | task_id | run_idand passes it as a parameter to the Databricks job, preventing duplicate executions if Airflow retries. - Rich execution context: it automatically injects into the job’s
notebook_paramsthe dag_id, task_id, owner, schedule, Airflow run URL, and environment (stg/prd), all available for logging and traceability inside the notebook itself. - Observability metrics: it sends series to Google Cloud Monitoring at the end of each execution, recording whether automatic repairs happened, which becomes the basis for alerts and platform health dashboards.
- Integrated callback:
CercCallbackHandlertriggers Slack notification and JiraOps ticket creation on failure, but only in production, ensuring every failure leaves a formal and actionable trail.
This operator was the point where the integration stopped being merely functional and became operationally reliable at scale.
Retry Policy: Less Is More
One of the decisions with the greatest operational impact was to simplify, deliberately, the repair policy.
Most platforms do the opposite: automatic retry on every failure, with aggressive backoff, hoping the problem resolves itself. The predictable result is an overloaded Databricks environment full of clusters restarting on errors that will not disappear with repeated attempts, plus an alert queue nobody takes seriously anymore.
We reversed the logic: by default, there is no automatic retry. The operator keeps an explicit list of known errors, cataloged and maintained by the platform team, that authorizes automatic repair through the Databricks API. Anything outside that list fails immediately and creates a JiraOps ticket.
Known errors
Quota exceeded, resource stockout, cluster startup failure, OOM, and network timeouts.
Automatic repair with 3ⁿ-second backoff, capped at 5 attempts.
Unknown errors
Any failure outside the explicit list of recoverable problems.
Immediate failure, a formal trail in JiraOps, and human intervention with full context.
This counterintuitive approach, less automation in retries, was one of the changes that most reduced daily operational load. Instead of masking symptoms, it forced the platform to distinguish recoverable failures from failures that actually required intervention.
Observability: From Failure to Context in Seconds
Failure without context is just noise. In a platform with hundreds of workflows, knowing a job failed is the minimum; what matters is shortening the path between failure, understanding, and action.
This was an important turning point in the project. Instead of treating observability as finishing work, we treated it as part of the architecture from the beginning. The goal was simple: the right person needed to receive the right context, with no manual triage.
Layer 1: Structured Incidents, Not Alert Noise
Our observability layer integrates directly with JiraOps to create structured incident tickets when pipeline failures cross severity thresholds. Each ticket is filled automatically with:
- The failed DAG and task identifier, with direct links to Airflow logs
- The Databricks job run URL and the cluster ID for immediate debugging
- The downstream datasets, annotated with potential impact
- The on-call owner resolved from team metadata
That turns alerts into work items with defined scope and ownership. On top of that, custom dashboards aggregate failure rates, SLA attainment, and cluster utilization across all ~1,800 workflows, giving team leads a single view of platform health without switching between Airflow, Databricks, and cloud consoles.
Layer 2: Surgical Observability Where Generic Logic Is Not Enough
Automated observability covers the most common case well: the job failed, the alert fired. But there is a class of problems that failure callbacks do not capture: jobs that complete successfully, but take much longer than they should.
A workflow that normally runs in 40 minutes and suddenly starts taking 18 hours will not create a JiraOps ticket. It will block downstream pipelines, consume cluster resources indefinitely, and only be noticed when someone happens to look at Airflow at the right moment.
For those cases, we built manually written monitoring DAGs, deliberately outside the DAG Factory. The DAG Factory is excellent for large-scale standardization, but some critical workflows deserve customized monitoring logic: specific duration thresholds, tolerance windows adjusted to the historical behavior of that job, and alerts segmented by delay severity.
A typical monitoring DAG queries execution history through the Airflow API, calculates the current runtime, and triggers the notification flow when the job exceeds its threshold, for example, more than 18 hours for workflows that historically finish within 2 hours. The alert arrives with context: current duration versus historical average, number of attempts, and a direct link to the run in Databricks.
We also have other specific types of monitoring for certain scenarios. It is Python.
That combination closed an important gap: explicit failures stopped being the only observable event. Silent abnormalities also started generating context and action.
Layer 3: Faster Diagnosis with Generative AI
Knowing a job failed and having a JiraOps ticket is already a major step. But there is a step beyond that: reaching the error with a diagnostic hypothesis before even opening the log.
We integrated Google Gemini into the observability flow for exactly that. When an error occurs in a pipeline, the failure callback not only creates the JiraOps ticket but also triggers Google Gemini, which analyzes the error message and sends an automated response to Slack along with the failure notification.
The Google Gemini response includes:
- Interpretation of the error message in natural language
- The most likely root-cause hypotheses
- Suggested remediation actions
The practical result is that the engineer who receives the alert starts with a hypothesis instead of starting from zero. In a platform with dozens of weekly failures, that significantly reduces diagnosis time.
Governance and Team Autonomy
When operations became more predictable, the next natural requirement appeared: give autonomy back to teams without giving up governance.
Access Control by Team
With ~1,800 workflows spread across multiple teams and distinct data domains, a natural operational challenge appears: how do you give each team autonomy to manage its own pipelines without giving unrestricted access to the orchestration environment?
We built an access control model based on DAG groups, configured through access_dag_groups.json. Each team has visibility and action permissions only over DAGs within its domain. The DAG Factory respects those settings when generating deployment artifacts, ensuring access isolation is declarative, versioned, and auditable, not dependent on manual configuration in the Airflow UI.
That separation allowed teams from different domains, ingestion, transformation, and data services, to operate with real independence without creating a new bottleneck in the platform team.
Deployment: Simplicity as a Principle
The deployment pipeline was designed to be as simple as possible, and that simplicity is not accidental, it is an architectural decision.
Google Cloud Composer manages all Airflow infrastructure: workers, scheduler, webserver, and metadata database. On our side, deployment is reduced to a single operation: syncing the dags/ and plugins/ directories with a bucket in Google Cloud Storage. Google Cloud Composer detects the changes and applies them automatically. There is no service restart, no maintenance window, and no manual procedure.
The CD process runs through Azure Pipelines and works like this:
- A PR is approved and merged into the main repository
- The CI pipeline validates the YAML specs through Pydantic and runs the DAG Factory, generating the DAG
.pyfiles - The CD pipeline performs the
rsyncbetween the repository and the Google Storage bucket - Google Cloud Composer detects the changes and syncs them, and the new DAGs appear in the UI within seconds
The Git repository is the source of truth. Any DAG that exists in Google Cloud Composer must exist in the repository. Any change goes through the pipeline, there is no manual DAG editing in production. That restriction eliminated an entire class of problems that used to consume too much energy: inconsistent deployments, environment drift, and the recurring question, “which version is running in production?”
Smart Databricks Workflow Launcher
Have you ever run a workflow, seen it succeed, and still not had the data updated? The job ran against a transactional table that had not been refreshed that day, and nobody noticed until someone looked at downstream data. That is wasted compute and the risk of silently producing stale results.
The freshness-aware launcher is a task in the DAG template that works as a pre-flight gate before every Databricks job trigger. It evaluates data recency against a configurable threshold and skips the job if transactional data was not updated within the expected window.
That pattern prevents unnecessary cluster startups across the platform. In a load of ~1,800 jobs, even a modest fraction of skipped executions multiplies into relevant monthly savings. Cost awareness at the execution layer, where the decision actually happens, generates immediate impact.
Continuous Documentation from Code
Documentation debt is endemic in data platforms. By the time a pipeline’s behavior is finally documented accurately, the code has already changed. Our architecture eliminates that problem structurally: documentation is generated from the same YAML specification that defines the pipeline, making divergence impossible.
Each YAML spec includes structured metadata, owner, description, upstream datasets, SLA expectations, downstream consumers, that the platform’s documentation engine renders into a browsable data catalog. That catalog is regenerated on every deployment, always reflecting the platform’s current state.
In addition, we integrated an LLM-based documentation assistant that enriches machine-generated catalog entries with natural-language summaries and usage guidance. The result is documentation that is both technically precise, because it derives from code, and human-readable, because it is enhanced by language models.
The Results: When the Platform Becomes Predictable
Every decision described so far had the same goal: take the platform out of reactive mode and put it into a predictable operating regime. The numbers below are the evidence that it worked:
| Metric | Before | After |
|---|---|---|
| Daily operational support | ~16 hrs (2 senior engineers) | ~30 min (1 junior engineer) |
| Orchestration cost (YoY) | Baseline | ~50% reduction (+ 2 environments - staging and homologation) |
| Workflows under governance | Fragmented, inconsistent | ~1,800 (unified model) |
| Deployment consistency | Variable by team | Standardized via DAG Factory |
| Failure traceability | Manual, slow, tribal | Automated via JiraOps |
| Data dependency model | Implicit (timing assumptions) | Explicit (Airflow Datasets) |
| Documentation freshness | Always stale | Regenerated on every deploy |
The most revealing metric is the support load. Dropping from 16 hours of daily coverage by senior engineers to 30 minutes managed by a junior engineer does not mean the platform became simpler. It means it became predictable. A predictable system is one where failures follow known patterns, alerts contain the information needed to act, and the platform’s behavior matches its specification. That is operable. Chaos is not.
And our mission is to reduce operational support load to zero, not because we want to eliminate engineering work, but because we want engineers to spend their time building new things, not extinguishing old and known fires. Automating support is the path to continuous innovation and to a platform that truly enables data teams to deliver value rather than merely keep the lights on.
What We Got Wrong (And What We Learned)
We do not tell this story as a clean success. The architecture worked, but the migration charged both technical and organizational tolls. These are the honest lessons:
We underestimated the YAML migration surface. Translating ~1,800 existing workflow definitions into YAML specifications was the longest phase of the project, not the engineering. Governance and data quality of the input specs matter as much as the quality of the generation engine. We invested time mapping which workflows were lower-risk candidates for the initial migration, and that accelerated the process. We performed the migration in waves, with many PRs and easy rollback. Some errors reached production, normal for a migration at this scale, but they were quickly corrected.
Strong opinions require organizational buy-in, not just technical enforcement. The DAG Factory works because teams adopted it. Getting teams to surrender their custom DAG patterns required more stakeholder management than we anticipated. The technical design was the easy part.
Airflow Datasets adoption is a journey, not a switch. We migrated the most critical pipelines first to Dataset-based scheduling. Many pipelines still run on cron. Deprecating implicit timing assumptions is ongoing work, not a completed migration.
Build observability first, even if it ships last. We designed JiraOps integration and dashboards into the architecture from the first week, but they were the last components to stabilize fully in production. In retrospect, we should have used a simpler incident mechanism as a fast path while the full system matured.
Lessons for Platform Teams
Distilled into their most portable form, these are the principles we would carry into the next platform project:
- Convention over configuration scales; freedom does not. Standardizing through the DAG Factory reduced cognitive overhead for every team using the platform;
- Declare dependencies or pay the cost of assumptions. Every implicit timing gap in a pipeline is a latent bug. Airflow Datasets provides the vocabulary to eliminate them;
- Cost awareness belongs in the execution layer. Freshness gates built into the operator, not into a monthly review, change the cost trajectory from the start;
- One expert, clear mandate, four weeks. Speed comes from empowered individuals making decisions, not from large teams building consensus. Trust your most experienced engineers to move quickly;
- Observability is architecture, not a feature. A platform without structured failure handling and automatic incident routing will route those failures to your senior engineers’ calendars;
What Comes Next
The system described here has been in production since March 2025, governing ~1,800 Databricks workflows. The platform is stable. Our next investments:
- LLM-based cost optimization agent: identifying compute waste patterns across the entire workflow catalog, generating proactive cluster right-sizing recommendations;
- Broader Airflow Datasets adoption: eliminating the remaining cron-based pipelines that still depend on timing assumptions;
- Self-service provisioning: enabling data teams to deploy new workflows end-to-end without platform team involvement, using the DAG Factory as the self-service interface;
The foundation is solid. The architecture is proven at scale. More importantly, it gave engineering time back to build, not just support. That is the clearest sign that the platform left chaos behind and entered a regime of predictability.
Technologies
| Layer | Technology |
|---|---|
| Compute | Databricks (Jobs, Workflows, Clusters) |
| Orchestration | Apache Airflow 2.x (Datasets, Callbacks, Custom Operators) |
| Managed Infrastructure | Google Cloud Composer |
| Validation | Python + Pydantic |
| Pipeline Specification | YAML |
| Incident Management | JiraOps |
| CI/CD | Automated DAG validation and deployment pipeline |
| LLM (Google Gemini) | Error analysis with diagnosis in Slack, catalog documentation generation |
CERC operates the Brazilian financial market infrastructure for receivables registration, a system where correctness, scale, and reliability are not optional. We built the data platform on which the financial system runs. If you want to work on problems like this, real scale, real consequences, and autonomy to design the right solution, we’re hiring.
This post was written by CERC’s Data Engineering team: Davi Campos, André Tayer, and Guilherme Oliveira.