Skip to content

Waste Detection & Insights

LakeSentry continuously scans your Databricks environment for wasted spend — resources that are running but not doing useful work, infrastructure that’s oversized for its actual load, and configurations that cost more than they need to.

LakeSentry looks for waste across several categories:

Interactive clusters that are in a RUNNING state but haven’t executed any jobs for an extended period. This is the most common type of Databricks waste — clusters left running after a developer finishes their work, or all-purpose clusters with auto-termination disabled.

Idle durationSeverity
2+ hoursMedium
6+ hoursHigh
12+ hoursCritical

Each idle cluster insight includes an estimated waste amount in dollars, calculated from the cluster’s hourly cost multiplied by the idle duration.

Fixed-size clusters with more workers than their actual utilization requires. LakeSentry analyzes CPU and memory utilization over time and recommends a reduced worker count.

The detection uses a median-based approach rather than simple averaging, which handles bursty workloads better. A cluster that spikes to 100% CPU for 5 minutes per hour but sits at 10% the rest of the time doesn’t need to be sized for the peak.

The algorithm:

  1. For each time interval, calculate the minimum workers needed to keep CPU below 85% and memory below 90%
  2. Take the median across all intervals
  3. Compare against the current worker count
Excess workersSeverity
3+ workersHigh
2 workersMedium
1 workerLow

Non-production workspaces with significant spend during weekends. This flags environments where dev/staging clusters could be shut down when nobody is working.

Model serving endpoints that haven’t received any inference requests in 90+ days. These endpoints incur cost even without traffic, and may have been left running after a model was retired.

Interactive clusters with auto-termination disabled or set above 120 minutes. Clusters with auto-termination disabled are flagged as critical severity since they will run indefinitely until manually stopped. Long timeout values (above 120 minutes) mean clusters stay running (and billing) long after the last user disconnects.

ON_DEMAND clusters running workloads that could tolerate spot/preemptible instances. Spot instances typically cost 60–90% less than on-demand pricing for fault-tolerant workloads.

Clusters with 1 worker that could run in single-node mode. Single-node clusters avoid the overhead of a separate driver and worker, reducing costs for workloads that don’t need distributed compute.

Clusters running non-current Databricks Runtime versions. Newer runtimes often include performance improvements that can reduce cost for the same workload.

Waste detection runs on a schedule — some detections (idle clusters, zombie models, weekend waste) run hourly, while most hygiene and optimization detections (overprovisioned workers, auto-termination, spot candidates, single-node candidates, outdated runtime) run daily. Each detection algorithm queries the latest ledger and metrics data, evaluates conditions, and creates insights for any resources that meet the criteria.

To reduce noise, cluster-related insights (auto-termination, spot candidates, outdated runtime) are only generated for clusters that:

  • Had activity in the last 30 days, OR
  • Were created in the last 7 days

This prevents LakeSentry from generating insights for long-dormant clusters that nobody cares about.

If LakeSentry already has an active insight for the same resource and issue type, it won’t create a duplicate. Existing insights are updated with fresh evidence (like a new idle duration) rather than replaced.

Waste insights use the same severity scale as anomalies:

SeverityMeaning
CriticalLarge financial impact or long-running waste (e.g., 12+ hour idle cluster)
HighSignificant waste worth addressing soon (e.g., 6+ hour idle, 3+ excess workers)
MediumModerate waste (e.g., 2+ hour idle, 2 excess workers)
LowMinor optimization opportunity
InfoInformational finding, no immediate action needed

Like anomalies, waste insights carry a confidence score based on the amount of data available:

Data qualityConfidenceContext
100+ utilization samples95%Strong data — recommendation is highly reliable
50–99 samples90%Good data — recommendation is reliable
20–49 samples75%Moderate data — recommendation is likely correct
Fewer than 20 samples60%Limited data — take recommendation with caution

Where possible, LakeSentry calculates estimated savings for each waste insight. These estimates are based on:

  • Current resource cost (from billing data)
  • The nature of the waste (idle time, excess workers, on-demand vs. spot pricing)
  • Historical utilization patterns

Estimated savings appear in the insight detail view and in action plans generated from the insight.

Waste insights are automatically resolved when the condition is no longer true. If you terminate an idle cluster, the insight resolves on the next detection cycle. If a cluster’s utilization increases to match its provisioning, the overprovisioned-workers insight resolves.

Snooze an insight if you’re aware of the issue but can’t address it right now. Snoozed insights automatically become active again after the snooze period expires.

Dismiss an insight if it’s not actionable for your situation — maybe the cluster needs to stay running for operational reasons, or the cost is acceptable. You can also set up auto-dismiss rules to automatically dismiss insights matching certain patterns.

Many waste insights have associated action plans that can be executed directly from LakeSentry. For example, an idle cluster insight may offer a “Terminate cluster” action plan.