How LakeSentry Works
LakeSentry transforms raw Databricks system table data into cost insights through a sequential data pipeline. Understanding this pipeline helps you reason about data freshness, troubleshoot missing data, and know when to expect insights after connecting your account.
The data pipeline
Section titled “The data pipeline”Data flows through five stages, each building on the previous one:
Extraction → Ingestion → Transformation → Analysis → Execution(your Databricks) (LakeSentry) (LakeSentry) (LakeSentry) (LakeSentry)| Stage | Where it runs | What it does | Output |
|---|---|---|---|
| Extraction | Your Databricks workspace | Collector reads system tables | Raw data pushed to LakeSentry |
| Ingestion | LakeSentry | Validates, deduplicates, stores | Raw event tables |
| Transformation | LakeSentry | Normalizes into canonical model | Ledger tables (workspaces, clusters, jobs, costs) |
| Analysis | LakeSentry | Computes metrics, detects anomalies and waste | Metrics tables, insights |
| Execution | LakeSentry → Databricks | Runs approved optimization actions | Actions applied to your infrastructure |
Each stage writes only to its own layer. Raw data is never modified after ingestion. The ledger is built entirely from raw data. Metrics are computed from the ledger. This means the entire pipeline can be rebuilt from raw data if needed.
Stage 1: Extraction
Section titled “Stage 1: Extraction”The collector is a lightweight Python job deployed in your Databricks workspace. It runs on a schedule (every 15 minutes by default) and reads from Databricks system tables.
What it reads:
| Category | Tables | Scope |
|---|---|---|
| Billing | system.billing.usage, system.billing.list_prices | Account-wide |
| Compute | system.compute.clusters, system.compute.warehouse_events, system.compute.node_timeline | Regional |
| Jobs & Pipelines | system.lakeflow.jobs, system.lakeflow.job_run_timeline, system.lakeflow.pipelines | Regional |
| Queries | system.query.history | Regional |
| ML | system.serving.served_entities, system.serving.endpoint_usage | Regional |
| Metadata | system.access.workspaces_latest | Account-wide |
The collector uses checkpoint-based incremental extraction. Each run picks up where the last one left off using watermark columns (like usage_start_time for billing or start_time for queries). Small reference tables (like list prices and node types) are extracted as full snapshots.
After extracting, the collector pushes the data to LakeSentry over HTTPS. Each push includes an extraction ID for deduplication — re-running from the same checkpoint is safe.
Stage 2: Ingestion
Section titled “Stage 2: Ingestion”When LakeSentry receives a data push from the collector, it:
- Authenticates the collector token
- Validates the record schema and checksum
- Inserts records into raw event tables (extraction ID deduplication rejects duplicates)
- Enqueues transformation jobs
Raw tables are append-only — records are inserted but never updated or deleted. This creates an immutable audit trail of everything the collector has sent.
Stage 3: Transformation
Section titled “Stage 3: Transformation”Background workers transform raw events into the ledger — LakeSentry’s canonical business model. This is where raw Databricks data becomes structured cost intelligence.
Key transformations:
| Raw source | Ledger target | What happens |
|---|---|---|
billing.usage | usage_line_item | Maps billing records to cost line items with price lookups |
lakeflow.jobs | work_unit | Creates canonical job entities |
lakeflow.job_run_timeline | work_unit_run | Maps job runs to work units with extracted metrics |
compute.clusters | cluster | Tracks cluster configuration changes over time |
query.history | query_history | Maps queries to work units where linkable |
Transforms run in dependency order — reference tables (workspaces, warehouses, SKU prices) are built first since other tables depend on them. Then identity mappings are resolved. Finally, derived tables like usage line items and work unit runs are populated.
Stage 4: Analysis
Section titled “Stage 4: Analysis”Once the ledger is up to date, analysis workers compute metrics and generate insights. This stage runs on a schedule:
| Job | Frequency | What it produces |
|---|---|---|
| Cost attribution | After each transform | Assigns costs to teams using attribution rules |
| Metric aggregation | After each transform / daily | Pre-computed cost rollups, utilization summaries, query analytics |
| Anomaly detection | Hourly | Flags unusual cost spikes using Z-score analysis |
| Waste detection | Hourly | Identifies idle clusters, orphaned resources, oversized compute |
| Significance scoring | Daily | Ranks work units by cost impact, frequency, and reliability |
| Action plan generation | Every 30 minutes | Creates optimization recommendations from insights |
Analysis produces two types of output:
- Metrics — Pre-aggregated tables that power dashboards and reports. These make queries fast by pre-computing common aggregations (cost by team per day, cluster utilization, query performance).
- Insights — Actionable findings about your Databricks spend. Each insight has a type, severity, and evidence explaining why it was flagged. See Anomaly Detection and Waste Detection for details.
Stage 5: Execution
Section titled “Stage 5: Execution”When insights have associated action plans, you can approve and execute them. Execution calls Databricks APIs to apply changes — terminating idle clusters, adjusting autoscaling, or configuring warehouse schedules.
This stage is entirely opt-in. LakeSentry runs read-only by default. See Action Plans & Automation for the safety model.
Data freshness
Section titled “Data freshness”After connecting your account, here’s what to expect:
| Milestone | Typical timing |
|---|---|
| First raw data appears | Within 15 minutes of collector’s first run |
| Ledger populated, dashboard shows cost data | Within 30 minutes |
| Initial insights (anomalies, waste) | Within a few hours |
| Significance scores and baselines | After first daily computation (overnight) |
Ongoing data freshness depends on the collector schedule. With the default 15-minute cycle:
- Cost data is at most ~30 minutes old (15 minutes for extraction + processing time)
- Insights are recomputed hourly from the latest ledger data
- Significance scores and baselines refresh daily
You can check data freshness on the Connectors page, which shows the last ingestion time and health status for each region connector.
Why sequential?
Section titled “Why sequential?”LakeSentry’s pipeline is strictly sequential — each layer builds from the previous layer, with no parallel writes across layers. This is a deliberate design choice:
- Debuggability — When something looks wrong in a metric, you can trace it back through the ledger to the raw data.
- Rebuildability — The entire pipeline can be reconstructed from raw data. Truncate a metric table and re-run the worker to get identical results.
- Consistency — No race conditions between writers. The ledger is always consistent because only transform workers write to it.
Next steps
Section titled “Next steps”- Cost Attribution & Confidence Tiers — How costs are assigned to teams
- Anomaly Detection — How cost spikes are identified
- Metrics & Aggregations — How metrics are computed and refreshed