Democratizing Financial Data: How GenAI Transformed Analytics Adoption at CERC
← Back to Articles

Democratizing Financial Data: How GenAI Transformed Analytics Adoption at CERC

TL;DR — CERC operates a 7 PB financial data platform with ~2,000 transactional tables. Databricks adoption stagnated below 15% — not because the platform was broken, but because users couldn’t find or understand the data. We built an AI-first cataloging layer using Dataplex Universal Catalog, Cloud Asset Inventory, and Gemini to auto-discover, enrich, and govern metadata. Data owners approve AI-generated catalogs in minutes; GenAI then auto-generates complete ingestion pipelines from that metadata. The outcome: 400% increase in monthly active users, 70% of CERC now doing self-service analytics on Databricks, and cataloging time down from 2–3 weeks to 2 days. The technical lift was manageable. The operational challenge was not — and that is what this post is actually about.


The Adoption Problem Nobody Talks About

Two years ago, CERC’s Databricks environment was technically sound and operationally underused. We had invested in infrastructure, onboarded teams, and built out a Delta Lake architecture on top of a 7 PB platform. Adoption sat at 15%.

The failure mode was not what we expected. Engineers were not avoiding Databricks because it was hard to use. They were avoiding it because they could not answer a simpler question first: what data is available, where does it live, and what does it mean?

CERC’s platform spans ~2,000 transactional tables across Google Cloud Spanner, Cloud SQL (PostgreSQL and SQL Server), and BigQuery — each maintained by different teams, documented at different levels of quality, and cataloged manually when cataloged at all. Manual cataloging took two to three weeks per source. At that pace, coverage could never keep up with the platform’s growth. The result was a data catalog that was always incomplete, often stale, and never trusted.

Adoption stagnates when users cannot self-serve. They cannot self-serve when they cannot find the data. And they cannot find the data when the catalog is a best-effort side project maintained by whoever had spare time last quarter.


Why We Went AI-First — And Why We Stayed GCP-Native

The solution space for data cataloging is crowded. We evaluated approaches ranging from enhanced manual processes with better tooling, to third-party catalog products, to a fully custom metadata pipeline built in-house.

ApproachReason ConsideredReason Rejected
Enhanced manual catalogingLow tooling investmentDoesn’t scale; bottleneck is human time, not tooling
Third-party catalog (Collibra, Alation)Mature products, proven governance featuresIntegration cost with GCP-native stack; additional vendor surface; licensing overhead
Custom metadata pipelineFull controlBuild cost high; LLM integration requires significant prompt engineering infrastructure
Dataplex + Gemini (GCP-native)✅ Native integration across our entire stack; single control plane; no data egress

The decision to stay GCP-native was straightforward given where our data already lives. Dataplex Universal Catalog has first-class connectors to Spanner, Cloud SQL, and BigQuery — the three systems that make up our transactional layer. Cloud Asset Inventory gives us GCP project metadata without a separate integration. And Gemini operates within the same security perimeter as our data, which matters in a regulated financial environment where data residency and access control are not optional.

Choosing Gemini over other models was not a pure capability decision. It was an architecture decision: keeping the enrichment pipeline inside GCP eliminated an entire class of compliance questions about what data leaves our environment and where it goes.


The Architecture: Four Layers, One Catalog

The system we built has four distinct layers, each solving a different part of the coverage problem.

PIPELINE ARCHITECTURE — FOUR LAYERSSOURCESCloud Asset InventoryDataplex APIIAM PoliciesEXPORTERS (AIRFLOW)3 daily DAGs · 3AM BRTAsset ExporterDataplex ExporterIAM Exporter→ BigQuery StagingMERGER PIPELINEData-Aware SchedulingYAML repository cloneMerge: CAI + Dataplex + YAMLDiff + Orphan detectionBatches → Vertex AI (Gemini)COALESCE: wrk › gem › prd→ YAML generated / updatedPUBLISHINGPull RequestAzure DevOps · Human reviewDataplex / BigQueryProduction catalogUnity Catalog SyncDatabricks · Scheduled jobAUTOMATIC CLASSIFICATION BY COLUMNhas_pii_datahas_confidential_datais_primary_keyreviewed (human protection)

Layer 1 — Automatic Discovery (Dataplex Universal Catalog)

Dataplex Universal Catalog continuously scans all registered data sources — Spanner instances, Cloud SQL databases, and BigQuery datasets — and extracts complete technical metadata: schemas, column types, data types, nullability, and cardinality estimates. Critically, it also runs PII classification automatically, flagging columns that contain sensitive data based on predefined DLP templates.

Before this layer, technical metadata existed in isolation in each source system. After, it exists in a single queryable catalog — updated on a schedule, not on human initiative.

The scanning is run by three independent Airflow DAGs, scheduled daily at 3 AM (Brasília time). Each DAG writes to its own staging tables in BigQuery with individually configured timeouts. The separation into independent modules provides resilience: if the Dataplex exporter fails due to an API issue, the other two continue normally — no cascading failure.

Layer 2 — Ownership Mapping (Cloud Asset Inventory)

Knowing what a table contains is not enough. Users also need to know who owns it and who to contact when something is wrong. Cloud Asset Inventory automatically maps data owners and stewards from GCP project metadata — the same metadata that already governs access control and billing allocation.

This layer required zero manual input from data teams. Ownership was already implicit in our GCP project structure; we made it explicit in the catalog.

Beyond owners and stewards, the exporter captures business labels already present in each GCP project — such as business_unit, team, and domain — making them searchable in the catalog without any additional manual input. A dedicated IAM exporter complements this mapping by analyzing permissions per resource and identifying who holds read access to each table, a dataset that feeds quarterly compliance reviews.

Layer 3 — Business Enrichment (Gemini + Confluence)

Technical metadata tells you what a column is. It does not tell you what it means in the context of CERC’s business domain. A column named op_type means something specific to the receivables registration business — and that meaning lives in Confluence, not in the database schema.

We gave Gemini access to our internal Confluence corpus and built a pipeline that generates business-layer descriptions for every table and column lacking documentation. The prompt context includes the table schema, existing documentation from related entities, and domain glossaries maintained by our business teams. The result is a description that is grounded in our actual domain — not a generic inference from column names.

Generated descriptions are not published automatically. They enter a human-in-the-loop approval workflow where data owners review and approve or edit before the enriched metadata goes live.

The model used is Gemini 2.5 Flash via Vertex AI, at temperature 0.0 for deterministic responses. Assets are sent in batches of 100, with up to 5 concurrent requests and automatic retry on failure.

Before invoking the model, the pipeline applies filters to avoid unnecessary processing: assets with reviewed: true and no structural changes are skipped; directories with a __base.yaml template generate metadata from the template without calling the AI; and an orphan detector automatically removes YAML files whose assets have been deleted from the sources.

After generation, a hierarchical merge combines three layers via COALESCE:

  1. wrk — human edits in the current YAML (highest priority)
  2. gem — Gemini-generated description (fills empty fields)
  3. prd — existing values in production BigQuery (baseline)

Manual edits are never overwritten by AI in future runs.

The review flow is implemented as an automatic pull request on Azure DevOps: the pipeline generates the YAMLs, opens the PR, and the Data Governance team reviews the diff before merging. Setting reviewed: true in a YAML field protects it from any subsequent automatic overwrite.

description: "Table of registered receivables with originator information."
reviewed: true    # protected — AI will not overwrite in future runs
has_pii_data: true
has_confidential_data: true
columns:
  - name: "originator_tax_id"
    description: "Tax ID of the receivable originator."
    has_pii_data: true
    has_confidential_data: false
    is_primary_key: false
  - name: "face_value"
    description: "Face value of the receivable in BRL."
    has_pii_data: false
    has_confidential_data: true
    is_primary_key: false

Layer 4 — Pipeline Generation

Once a table is cataloged and approved, GenAI auto-generates the complete ingestion pipeline — type mappings from the source system’s native types to Delta Lake types, partitioning strategies based on column cardinality and query patterns, and optimization hints for the target Databricks environment. What previously required a data engineer to read the schema, map the types, and write the pipeline by hand now takes minutes.


The Results

The catalog went live incrementally, source by source. Adoption followed the coverage — as more tables became discoverable and understandable, more users engaged with Databricks for the first time.

MetricBeforeAfter
Databricks monthly active usersBaseline+400% increase
Databricks adoption across CERC~15%70%
Cataloging time per source2–3 weeks2 days
Genie data room effectivenessLow (poor metadata)High (accurate metadata)
PII classification coverageManual, incompleteAutomated, continuous

The most meaningful number is the 70% adoption figure. That is not a metric about the catalog — it is a metric about trust. When users can find data, understand what it means, know who owns it, and see that it is classified and governed, they use it. The catalog was not the destination. Self-service analytics was. The catalog was what made the destination reachable.


What We Got Wrong: The Operational Reality

The technical architecture was not the hard part.

Building the discovery and enrichment pipeline took less time than we anticipated. Dataplex and Cloud Asset Inventory integrate naturally; the Gemini enrichment pipeline, once the prompt engineering was stabilized, runs reliably. The infrastructure is not complex.

The human-in-the-loop workflow is where the operational complexity lives.

Every AI-generated description requires a data owner to review and approve it. At 2,000 tables, that is 2,000 approval decisions distributed across dozens of teams with different levels of engagement, different interpretations of “good enough,” and competing priorities. Some data owners approve quickly and thoroughly. Others let the queue grow. A few pushed back on the entire concept — they were not comfortable with an AI generating the authoritative description of data they were responsible for.

We underestimated how much change management the approval workflow required. The system works when data owners engage. When they don’t, tables remain in a pending state — technically discovered but not enriched, which means they appear in search results without business context. A partially cataloged table that surfaces in a search can be worse than no result at all, because it creates the impression of coverage without the substance.

The lessons we carry:

  • Approval SLAs need teeth. Without an escalation path for stale approvals, the queue fills up and the catalog coverage promise breaks.
  • Engagement varies by team culture, not just by workload. Teams with a data ownership culture approved quickly. Teams where data responsibility was diffuse needed more active facilitation.
  • The AI-generated description quality matters more than you expect. When Gemini produced a description that was clearly generic or slightly wrong, data owners lost confidence in the whole system — even though the fix was a single edit. Prompt quality is not a nice-to-have; it is the trust baseline.

What Comes Next

The catalog is now stable and growing. Our next investments:

  • Automated SLA enforcement for the approval workflow — surfacing stale approvals to team leads automatically, with escalation paths
  • Active metadata quality scoring — a per-table metric that reflects coverage, recency, and approval status, visible to both data consumers and owners
  • Extending pipeline generation to handle schema evolution automatically — today, schema changes require a manual review of the generated pipeline; this should be automated
  • Expanding Genie data room adoption — the jump in metadata quality has made Genie significantly more effective; structured enablement is the next lever

Technologies

LayerTechnology
Metadata DiscoveryDataplex Universal Catalog
Ownership MappingCloud Asset Inventory
AI EnrichmentGemini 2.5 Flash via Vertex AI
PII ClassificationCloud DLP (integrated with Dataplex)
Transactional SourcesSpanner, Cloud SQL (PostgreSQL, SQL Server)
Analytical TargetDatabricks (Unity Catalog, Delta Lake, Genie Data Rooms)
Pipeline GenerationGenAI (schema-to-pipeline from metadata)
OrchestrationApache Airflow (3 daily DAGs, Data-Aware Scheduling)
Human ReviewAzure DevOps (automatic pull requests)

CERC operates Brazil’s financial market infrastructure for receivables registration — a system where data quality, governance, and auditability are regulatory requirements, not engineering choices. If you want to work on problems where the data platform is the product — we are hiring.


This post was written by the CERC Data Engineering team: Davi Campos, André Tayer, Guilherme Oliveira, and Robson Sampaio.