How LakeSentry Works

LakeSentry transforms raw Databricks system table data into cost insights through a sequential data pipeline. Understanding this pipeline helps you reason about data freshness, troubleshoot missing data, and know when to expect insights after connecting your account.

The data pipeline

Data flows through five stages, each building on the previous one:

Extraction → Ingestion → Transformation → Analysis → Execution
(your Databricks)   (LakeSentry)   (LakeSentry)     (LakeSentry)  (LakeSentry)

Stage	Where it runs	What it does	Output
Extraction	Your Databricks workspace	Collector reads system tables	Raw data pushed to LakeSentry
Ingestion	LakeSentry	Validates, deduplicates, stores	Raw event tables
Transformation	LakeSentry	Normalizes into canonical model	Ledger tables (workspaces, clusters, jobs, costs)
Analysis	LakeSentry	Computes metrics, detects anomalies and waste	Metrics tables, insights
Execution	LakeSentry → Databricks	Runs approved optimization actions	Actions applied to your infrastructure

Each stage writes only to its own layer. Raw data is never modified after ingestion. The ledger is built entirely from raw data. Metrics are computed from the ledger. This means the entire pipeline can be rebuilt from raw data if needed.

Stage 1: Extraction

The collector is a lightweight Python job deployed in your Databricks workspace. It runs on a schedule (every 15 minutes by default) and reads from Databricks system tables.

What it reads:

Category	Tables	Scope
Billing	`system.billing.usage`, `system.billing.list_prices`	Account-wide
Compute	`system.compute.clusters`, `system.compute.warehouse_events`, `system.compute.node_timeline`	Regional
Jobs & Pipelines	`system.lakeflow.jobs`, `system.lakeflow.job_run_timeline`, `system.lakeflow.pipelines`	Regional
Queries	`system.query.history`	Regional
ML	`system.serving.served_entities`, `system.serving.endpoint_usage`	Regional
Metadata	`system.access.workspaces_latest`	Account-wide

The collector uses checkpoint-based incremental extraction. Each run picks up where the last one left off using watermark columns (like usage_start_time for billing or start_time for queries). Small reference tables (like list prices and node types) are extracted as full snapshots.

After extracting, the collector pushes the data to LakeSentry over HTTPS. Each push includes an extraction ID for deduplication — re-running from the same checkpoint is safe.

Stage 2: Ingestion

When LakeSentry receives a data push from the collector, it:

Authenticates the collector token
Validates the record schema and checksum
Inserts records into raw event tables (extraction ID deduplication rejects duplicates)
Enqueues transformation jobs

Raw tables are append-only — records are inserted but never updated or deleted. This creates an immutable audit trail of everything the collector has sent.

Stage 3: Transformation

Background workers transform raw events into the ledger — LakeSentry’s canonical business model. This is where raw Databricks data becomes structured cost intelligence.

Key transformations:

Raw source	Ledger target	What happens
`billing.usage`	`usage_line_item`	Maps billing records to cost line items with price lookups
`lakeflow.jobs`	`work_unit`	Creates canonical job entities
`lakeflow.job_run_timeline`	`work_unit_run`	Maps job runs to work units with extracted metrics
`compute.clusters`	`cluster`	Tracks cluster configuration changes over time
`query.history`	`query_history`	Maps queries to work units where linkable

Transforms run in dependency order — reference tables (workspaces, warehouses, SKU prices) are built first since other tables depend on them. Then identity mappings are resolved. Finally, derived tables like usage line items and work unit runs are populated.

Stage 4: Analysis

Once the ledger is up to date, analysis workers compute metrics and generate insights. This stage runs on a schedule:

Job	Frequency	What it produces
Cost attribution	After each transform	Assigns costs to teams using attribution rules
Metric aggregation	After each transform / daily	Pre-computed cost rollups, utilization summaries, query analytics
Anomaly detection	Hourly	Flags unusual cost spikes using Z-score analysis
Waste detection	Hourly	Identifies idle clusters, orphaned resources, oversized compute
Significance scoring	Daily	Ranks work units by cost impact, frequency, and reliability
Action plan generation	Every 30 minutes	Creates optimization recommendations from insights

Analysis produces two types of output:

Metrics — Pre-aggregated tables that power dashboards and reports. These make queries fast by pre-computing common aggregations (cost by team per day, cluster utilization, query performance).
Insights — Actionable findings about your Databricks spend. Each insight has a type, severity, and evidence explaining why it was flagged. See Anomaly Detection and Waste Detection for details.

Stage 5: Execution

When insights have associated action plans, you can approve and execute them. Execution calls Databricks APIs to apply changes — terminating idle clusters, adjusting autoscaling, or configuring warehouse schedules.

This stage is entirely opt-in. LakeSentry runs read-only by default. See Action Plans & Automation for the safety model.

Data freshness

After connecting your account, here’s what to expect:

Milestone	Typical timing
First raw data appears	Within 15 minutes of collector’s first run
Ledger populated, dashboard shows cost data	Within 30 minutes
Initial insights (anomalies, waste)	Within a few hours
Significance scores and baselines	After first daily computation (overnight)

Ongoing data freshness depends on the collector schedule. With the default 15-minute cycle:

Cost data is at most ~30 minutes old (15 minutes for extraction + processing time)
Insights are recomputed hourly from the latest ledger data
Significance scores and baselines refresh daily

You can check data freshness on the Connectors page, which shows the last ingestion time and health status for each region connector.

Why sequential?

LakeSentry’s pipeline is strictly sequential — each layer builds from the previous layer, with no parallel writes across layers. This is a deliberate design choice:

Debuggability — When something looks wrong in a metric, you can trace it back through the ledger to the raw data.
Rebuildability — The entire pipeline can be reconstructed from raw data. Truncate a metric table and re-run the worker to get identical results.
Consistency — No race conditions between writers. The ledger is always consistent because only transform workers write to it.

Next steps

Cost Attribution & Confidence Tiers — How costs are assigned to teams
Anomaly Detection — How cost spikes are identified
Metrics & Aggregations — How metrics are computed and refreshed