Skip to content

How LakeSentry Works

LakeSentry transforms raw Databricks system table data into cost insights through a sequential data pipeline. Understanding this pipeline helps you reason about data freshness, troubleshoot missing data, and know when to expect insights after connecting your account.

Data flows through five stages, each building on the previous one:

Extraction → Ingestion → Transformation → Analysis → Execution
(your Databricks) (LakeSentry) (LakeSentry) (LakeSentry) (LakeSentry)
StageWhere it runsWhat it doesOutput
ExtractionYour Databricks workspaceCollector reads system tablesRaw data pushed to LakeSentry
IngestionLakeSentryValidates, deduplicates, storesRaw event tables
TransformationLakeSentryNormalizes into canonical modelLedger tables (workspaces, clusters, jobs, costs)
AnalysisLakeSentryComputes metrics, detects anomalies and wasteMetrics tables, insights
ExecutionLakeSentry → DatabricksRuns approved optimization actionsActions applied to your infrastructure

Each stage writes only to its own layer. Raw data is never modified after ingestion. The ledger is built entirely from raw data. Metrics are computed from the ledger. This means the entire pipeline can be rebuilt from raw data if needed.

The collector is a lightweight Python job deployed in your Databricks workspace. It runs on a schedule (every 15 minutes by default) and reads from Databricks system tables.

What it reads:

CategoryTablesScope
Billingsystem.billing.usage, system.billing.list_pricesAccount-wide
Computesystem.compute.clusters, system.compute.warehouse_events, system.compute.node_timelineRegional
Jobs & Pipelinessystem.lakeflow.jobs, system.lakeflow.job_run_timeline, system.lakeflow.pipelinesRegional
Queriessystem.query.historyRegional
MLsystem.serving.served_entities, system.serving.endpoint_usageRegional
Metadatasystem.access.workspaces_latestAccount-wide

The collector uses checkpoint-based incremental extraction. Each run picks up where the last one left off using watermark columns (like usage_start_time for billing or start_time for queries). Small reference tables (like list prices and node types) are extracted as full snapshots.

After extracting, the collector pushes the data to LakeSentry over HTTPS. Each push includes an extraction ID for deduplication — re-running from the same checkpoint is safe.

When LakeSentry receives a data push from the collector, it:

  1. Authenticates the collector token
  2. Validates the record schema and checksum
  3. Inserts records into raw event tables (extraction ID deduplication rejects duplicates)
  4. Enqueues transformation jobs

Raw tables are append-only — records are inserted but never updated or deleted. This creates an immutable audit trail of everything the collector has sent.

Background workers transform raw events into the ledger — LakeSentry’s canonical business model. This is where raw Databricks data becomes structured cost intelligence.

Key transformations:

Raw sourceLedger targetWhat happens
billing.usageusage_line_itemMaps billing records to cost line items with price lookups
lakeflow.jobswork_unitCreates canonical job entities
lakeflow.job_run_timelinework_unit_runMaps job runs to work units with extracted metrics
compute.clustersclusterTracks cluster configuration changes over time
query.historyquery_historyMaps queries to work units where linkable

Transforms run in dependency order — reference tables (workspaces, warehouses, SKU prices) are built first since other tables depend on them. Then identity mappings are resolved. Finally, derived tables like usage line items and work unit runs are populated.

Once the ledger is up to date, analysis workers compute metrics and generate insights. This stage runs on a schedule:

JobFrequencyWhat it produces
Cost attributionAfter each transformAssigns costs to teams using attribution rules
Metric aggregationAfter each transform / dailyPre-computed cost rollups, utilization summaries, query analytics
Anomaly detectionHourlyFlags unusual cost spikes using Z-score analysis
Waste detectionHourlyIdentifies idle clusters, orphaned resources, oversized compute
Significance scoringDailyRanks work units by cost impact, frequency, and reliability
Action plan generationEvery 30 minutesCreates optimization recommendations from insights

Analysis produces two types of output:

  • Metrics — Pre-aggregated tables that power dashboards and reports. These make queries fast by pre-computing common aggregations (cost by team per day, cluster utilization, query performance).
  • Insights — Actionable findings about your Databricks spend. Each insight has a type, severity, and evidence explaining why it was flagged. See Anomaly Detection and Waste Detection for details.

When insights have associated action plans, you can approve and execute them. Execution calls Databricks APIs to apply changes — terminating idle clusters, adjusting autoscaling, or configuring warehouse schedules.

This stage is entirely opt-in. LakeSentry runs read-only by default. See Action Plans & Automation for the safety model.

After connecting your account, here’s what to expect:

MilestoneTypical timing
First raw data appearsWithin 15 minutes of collector’s first run
Ledger populated, dashboard shows cost dataWithin 30 minutes
Initial insights (anomalies, waste)Within a few hours
Significance scores and baselinesAfter first daily computation (overnight)

Ongoing data freshness depends on the collector schedule. With the default 15-minute cycle:

  • Cost data is at most ~30 minutes old (15 minutes for extraction + processing time)
  • Insights are recomputed hourly from the latest ledger data
  • Significance scores and baselines refresh daily

You can check data freshness on the Connectors page, which shows the last ingestion time and health status for each region connector.

LakeSentry’s pipeline is strictly sequential — each layer builds from the previous layer, with no parallel writes across layers. This is a deliberate design choice:

  • Debuggability — When something looks wrong in a metric, you can trace it back through the ledger to the raw data.
  • Rebuildability — The entire pipeline can be reconstructed from raw data. Truncate a metric table and re-run the worker to get identical results.
  • Consistency — No race conditions between writers. The ledger is always consistent because only transform workers write to it.