Anomaly Detection
LakeSentry automatically detects unusual cost patterns in your Databricks spend. When a job suddenly costs 3x its normal amount, or a workspace’s daily spend spikes overnight, LakeSentry flags it as an anomaly with evidence explaining why.
How anomalies are detected
Section titled “How anomalies are detected”LakeSentry uses Z-score analysis — a statistical method that compares recent values against a historical baseline. The idea is straightforward: if a job’s recent cost is far enough from its average cost, something unusual is happening.
The Z-score formula
Section titled “The Z-score formula”z_score = (recent_value - baseline_average) / baseline_standard_deviationA higher Z-score means the recent value is further from what’s normal. LakeSentry flags a value as anomalous when the Z-score exceeds a threshold (default: 2.0 for cost spikes, 2.5 for duration anomalies).
To put that in context with a normal distribution:
| Z-score | What it means | Probability of occurring naturally |
|---|---|---|
| 2.0 | Notably higher than average (cost spike threshold) | ~2.3% |
| 2.5 | Unusually high (duration anomaly threshold) | ~0.6% |
| 3.0 | Very unusual | ~0.13% |
| 4.0 | Extremely unusual | ~0.003% |
| 5.0+ | Almost certainly not random variation | ~0.00003% |
Baseline computation
Section titled “Baseline computation”The baseline is computed from the previous 30 days of data. LakeSentry requires at least 5 data points to establish a valid baseline — without enough history, it can’t tell what’s “normal” versus what’s a spike.
For cost spike detection specifically, LakeSentry uses a dual-trigger approach:
- Z-score trigger — Z-score exceeds 2.0
- Multiplier trigger — Recent average cost exceeds 2.5x the baseline average
Either trigger is sufficient to flag an anomaly. The multiplier trigger catches cases where the Z-score might be low due to high variance in the baseline.
Minimum thresholds
Section titled “Minimum thresholds”To avoid flagging insignificant fluctuations, anomaly detection applies minimum thresholds:
- Minimum baseline cost: $10 — Work units with a baseline under $10 aren’t evaluated
- Minimum cost delta: $50 — The absolute cost increase must be at least $50
These thresholds mean LakeSentry focuses on anomalies that matter financially, not statistical noise on low-cost workloads.
Severity levels
Section titled “Severity levels”Each cost spike anomaly is assigned a severity based on the Z-score and cost multiplier (whichever is more extreme):
| Severity | Z-score or multiplier | What it suggests |
|---|---|---|
| Critical | Z-score > 5.0 or multiplier > 5x | Extreme deviation — likely a configuration change, runaway job, or billing error |
| High | Z-score > 4.0 or multiplier > 4x | Major deviation from normal — warrants immediate investigation |
| Medium | Below the above thresholds | Notable increase — worth reviewing but may resolve on its own |
Confidence scoring
Section titled “Confidence scoring”LakeSentry assigns a confidence score to each anomaly based on the strength of the signal. For cost spike anomalies, confidence is derived from the z-score and cost multiplier — a stronger statistical signal or a higher cost ratio yields higher confidence, capped at 90%. For other anomaly types (duration anomalies, failure rate spikes, warehouse spend), confidence is calculated using type-specific formulas tied to the magnitude of the deviation.
A low-confidence anomaly isn’t necessarily wrong, but it means the statistical signal was weaker. New jobs or recently changed jobs that haven’t established a strong baseline will naturally produce weaker signals.
What anomalies look like
Section titled “What anomalies look like”When LakeSentry detects an anomaly, it creates an insight with evidence that includes:
- Baseline average cost — What the work unit normally costs
- Recent average cost — What it cost in the detection window
- Cost delta — The absolute dollar increase
- Cost multiplier — How many times higher than normal (e.g., 3.2x)
- Z-score — The statistical measure of how unusual this is
- Recent runs — How many runs occurred in the detection window
This evidence helps you quickly assess whether the anomaly needs investigation or is expected (like a planned capacity increase).
Types of anomalies detected
Section titled “Types of anomalies detected”LakeSentry monitors for cost anomalies across multiple dimensions:
| Anomaly type | What it detects |
|---|---|
| Work unit cost spike | A job or pipeline’s per-run cost is significantly higher than its baseline |
| Duration anomaly | A work unit’s run duration is significantly longer than its baseline |
| Failure rate spike | A work unit’s failure rate has increased significantly over the baseline |
| Warehouse spend spike | A SQL warehouse’s spend has increased significantly compared to the prior period |
| Serving endpoint spike | A serving endpoint’s spend has increased significantly week-over-week |
| Attribution declining | The percentage of unattributed cost is increasing week-over-week |
| Budget risk | Projected spend is on track to exceed a configured budget |
Significance scoring
Section titled “Significance scoring”Beyond individual anomalies, LakeSentry computes a significance score (0–100) for every work unit. This helps you focus on what matters most, not just what spiked recently.
The significance score combines three factors:
| Factor | Weight | What it measures |
|---|---|---|
| Cost impact | 40% | Month-to-date spend relative to highest spender (log-normalized) |
| Execution frequency | 35% | How often the work unit runs relative to others |
| Failure rate | 25% | How often runs fail (higher failure rate = higher significance) |
Based on the composite score, work units are categorized:
| Score range | Category | Meaning |
|---|---|---|
| 90–100 | Top Spender | Top 10% by cost — always worth monitoring |
| 70–89 | High Impact | Significant cost or frequency |
| 40–69 | Medium Impact | Average cost and frequency |
| 0–39 | Low Impact | Rarely runs or low cost |
Significance scores refresh daily and appear as badges in the work unit list, helping you prioritize which anomalies to investigate first.
Insight lifecycle
Section titled “Insight lifecycle”Anomaly insights follow a lifecycle:
- Active — A new anomaly was detected and needs attention.
- Snoozed — You’ve acknowledged it but want to revisit later. Auto-unsnoozes after the snooze period.
- Resolved — Either the condition is no longer true (auto-resolved), you executed an action to fix it (resolved by action), or you manually marked it resolved.
- Dismissed — You’ve determined it’s not actionable. LakeSentry tracks dismissals to improve detection.
- Superseded — A newer anomaly for the same resource replaced this one.
LakeSentry also supports auto-dismiss rules — configurable patterns that automatically dismiss insights matching certain criteria. This is useful for known exceptions (like monthly batch jobs that always spike).
Next steps
Section titled “Next steps”- Waste Detection & Insights — How idle resources and waste are identified
- Insights & Actions — Viewing and acting on anomalies in the UI
- Overview Dashboard — Where anomaly highlights appear