Skip to content

Anomaly Detection

LakeSentry automatically detects unusual cost patterns in your Databricks spend. When a job suddenly costs 3x its normal amount, or a workspace’s daily spend spikes overnight, LakeSentry flags it as an anomaly with evidence explaining why.

LakeSentry uses Z-score analysis — a statistical method that compares recent values against a historical baseline. The idea is straightforward: if a job’s recent cost is far enough from its average cost, something unusual is happening.

z_score = (recent_value - baseline_average) / baseline_standard_deviation

A higher Z-score means the recent value is further from what’s normal. LakeSentry flags a value as anomalous when the Z-score exceeds a threshold (default: 2.0 for cost spikes, 2.5 for duration anomalies).

To put that in context with a normal distribution:

Z-scoreWhat it meansProbability of occurring naturally
2.0Notably higher than average (cost spike threshold)~2.3%
2.5Unusually high (duration anomaly threshold)~0.6%
3.0Very unusual~0.13%
4.0Extremely unusual~0.003%
5.0+Almost certainly not random variation~0.00003%

The baseline is computed from the previous 30 days of data. LakeSentry requires at least 5 data points to establish a valid baseline — without enough history, it can’t tell what’s “normal” versus what’s a spike.

For cost spike detection specifically, LakeSentry uses a dual-trigger approach:

  • Z-score trigger — Z-score exceeds 2.0
  • Multiplier trigger — Recent average cost exceeds 2.5x the baseline average

Either trigger is sufficient to flag an anomaly. The multiplier trigger catches cases where the Z-score might be low due to high variance in the baseline.

To avoid flagging insignificant fluctuations, anomaly detection applies minimum thresholds:

  • Minimum baseline cost: $10 — Work units with a baseline under $10 aren’t evaluated
  • Minimum cost delta: $50 — The absolute cost increase must be at least $50

These thresholds mean LakeSentry focuses on anomalies that matter financially, not statistical noise on low-cost workloads.

Each cost spike anomaly is assigned a severity based on the Z-score and cost multiplier (whichever is more extreme):

SeverityZ-score or multiplierWhat it suggests
CriticalZ-score > 5.0 or multiplier > 5xExtreme deviation — likely a configuration change, runaway job, or billing error
HighZ-score > 4.0 or multiplier > 4xMajor deviation from normal — warrants immediate investigation
MediumBelow the above thresholdsNotable increase — worth reviewing but may resolve on its own

LakeSentry assigns a confidence score to each anomaly based on the strength of the signal. For cost spike anomalies, confidence is derived from the z-score and cost multiplier — a stronger statistical signal or a higher cost ratio yields higher confidence, capped at 90%. For other anomaly types (duration anomalies, failure rate spikes, warehouse spend), confidence is calculated using type-specific formulas tied to the magnitude of the deviation.

A low-confidence anomaly isn’t necessarily wrong, but it means the statistical signal was weaker. New jobs or recently changed jobs that haven’t established a strong baseline will naturally produce weaker signals.

When LakeSentry detects an anomaly, it creates an insight with evidence that includes:

  • Baseline average cost — What the work unit normally costs
  • Recent average cost — What it cost in the detection window
  • Cost delta — The absolute dollar increase
  • Cost multiplier — How many times higher than normal (e.g., 3.2x)
  • Z-score — The statistical measure of how unusual this is
  • Recent runs — How many runs occurred in the detection window

This evidence helps you quickly assess whether the anomaly needs investigation or is expected (like a planned capacity increase).

LakeSentry monitors for cost anomalies across multiple dimensions:

Anomaly typeWhat it detects
Work unit cost spikeA job or pipeline’s per-run cost is significantly higher than its baseline
Duration anomalyA work unit’s run duration is significantly longer than its baseline
Failure rate spikeA work unit’s failure rate has increased significantly over the baseline
Warehouse spend spikeA SQL warehouse’s spend has increased significantly compared to the prior period
Serving endpoint spikeA serving endpoint’s spend has increased significantly week-over-week
Attribution decliningThe percentage of unattributed cost is increasing week-over-week
Budget riskProjected spend is on track to exceed a configured budget

Beyond individual anomalies, LakeSentry computes a significance score (0–100) for every work unit. This helps you focus on what matters most, not just what spiked recently.

The significance score combines three factors:

FactorWeightWhat it measures
Cost impact40%Month-to-date spend relative to highest spender (log-normalized)
Execution frequency35%How often the work unit runs relative to others
Failure rate25%How often runs fail (higher failure rate = higher significance)

Based on the composite score, work units are categorized:

Score rangeCategoryMeaning
90–100Top SpenderTop 10% by cost — always worth monitoring
70–89High ImpactSignificant cost or frequency
40–69Medium ImpactAverage cost and frequency
0–39Low ImpactRarely runs or low cost

Significance scores refresh daily and appear as badges in the work unit list, helping you prioritize which anomalies to investigate first.

Anomaly insights follow a lifecycle:

  1. Active — A new anomaly was detected and needs attention.
  2. Snoozed — You’ve acknowledged it but want to revisit later. Auto-unsnoozes after the snooze period.
  3. Resolved — Either the condition is no longer true (auto-resolved), you executed an action to fix it (resolved by action), or you manually marked it resolved.
  4. Dismissed — You’ve determined it’s not actionable. LakeSentry tracks dismissals to improve detection.
  5. Superseded — A newer anomaly for the same resource replaced this one.

LakeSentry also supports auto-dismiss rules — configurable patterns that automatically dismiss insights matching certain criteria. This is useful for known exceptions (like monthly batch jobs that always spike).