Anomaly Detection

LakeSentry automatically detects unusual cost patterns in your Databricks spend. When a job suddenly costs 3x its normal amount, or a workspace’s daily spend spikes overnight, LakeSentry flags it as an anomaly with evidence explaining why.

How anomalies are detected

LakeSentry uses Z-score analysis — a statistical method that compares recent values against a historical baseline. The idea is straightforward: if a job’s recent cost is far enough from its average cost, something unusual is happening.

The Z-score formula

z_score = (recent_value - baseline_average) / baseline_standard_deviation

A higher Z-score means the recent value is further from what’s normal. LakeSentry flags a value as anomalous when the Z-score exceeds a threshold (default: 2.0 for cost spikes, 2.5 for duration anomalies).

To put that in context with a normal distribution:

Z-score	What it means	Probability of occurring naturally
2.0	Notably higher than average (cost spike threshold)	~2.3%
2.5	Unusually high (duration anomaly threshold)	~0.6%
3.0	Very unusual	~0.13%
4.0	Extremely unusual	~0.003%
5.0+	Almost certainly not random variation	~0.00003%

Baseline computation

The baseline is computed from the previous 30 days of data. LakeSentry requires at least 5 data points to establish a valid baseline — without enough history, it can’t tell what’s “normal” versus what’s a spike.

For cost spike detection specifically, LakeSentry uses a dual-trigger approach:

Z-score trigger — Z-score exceeds 2.0
Multiplier trigger — Recent average cost exceeds 2.5x the baseline average

Either trigger is sufficient to flag an anomaly. The multiplier trigger catches cases where the Z-score might be low due to high variance in the baseline.

Minimum thresholds

To avoid flagging insignificant fluctuations, anomaly detection applies minimum thresholds:

Minimum baseline cost: $10 — Work units with a baseline under $10 aren’t evaluated
Minimum cost delta: $50 — The absolute cost increase must be at least $50

These thresholds mean LakeSentry focuses on anomalies that matter financially, not statistical noise on low-cost workloads.

Severity levels

Each cost spike anomaly is assigned a severity based on the Z-score and cost multiplier (whichever is more extreme):

Severity	Z-score or multiplier	What it suggests
Critical	Z-score > 5.0 or multiplier > 5x	Extreme deviation — likely a configuration change, runaway job, or billing error
High	Z-score > 4.0 or multiplier > 4x	Major deviation from normal — warrants immediate investigation
Medium	Below the above thresholds	Notable increase — worth reviewing but may resolve on its own

Confidence scoring

LakeSentry assigns a confidence score to each anomaly based on the strength of the signal. For cost spike anomalies, confidence is derived from the z-score and cost multiplier — a stronger statistical signal or a higher cost ratio yields higher confidence, capped at 90%. For other anomaly types (duration anomalies, failure rate spikes, warehouse spend), confidence is calculated using type-specific formulas tied to the magnitude of the deviation.

A low-confidence anomaly isn’t necessarily wrong, but it means the statistical signal was weaker. New jobs or recently changed jobs that haven’t established a strong baseline will naturally produce weaker signals.

What anomalies look like

When LakeSentry detects an anomaly, it creates an insight with evidence that includes:

Baseline average cost — What the work unit normally costs
Recent average cost — What it cost in the detection window
Cost delta — The absolute dollar increase
Cost multiplier — How many times higher than normal (e.g., 3.2x)
Z-score — The statistical measure of how unusual this is
Recent runs — How many runs occurred in the detection window

This evidence helps you quickly assess whether the anomaly needs investigation or is expected (like a planned capacity increase).

Types of anomalies detected

LakeSentry monitors for cost anomalies across multiple dimensions:

Anomaly type	What it detects
Work unit cost spike	A job or pipeline’s per-run cost is significantly higher than its baseline
Duration anomaly	A work unit’s run duration is significantly longer than its baseline
Failure rate spike	A work unit’s failure rate has increased significantly over the baseline
Warehouse spend spike	A SQL warehouse’s spend has increased significantly compared to the prior period
Serving endpoint spike	A serving endpoint’s spend has increased significantly week-over-week
Attribution declining	The percentage of unattributed cost is increasing week-over-week
Budget risk	Projected spend is on track to exceed a configured budget

Significance scoring

Beyond individual anomalies, LakeSentry computes a significance score (0–100) for every work unit. This helps you focus on what matters most, not just what spiked recently.

The significance score combines three factors:

Factor	Weight	What it measures
Cost impact	40%	Month-to-date spend relative to highest spender (log-normalized)
Execution frequency	35%	How often the work unit runs relative to others
Failure rate	25%	How often runs fail (higher failure rate = higher significance)

Based on the composite score, work units are categorized:

Score range	Category	Meaning
90–100	Top Spender	Top 10% by cost — always worth monitoring
70–89	High Impact	Significant cost or frequency
40–69	Medium Impact	Average cost and frequency
0–39	Low Impact	Rarely runs or low cost

Significance scores refresh daily and appear as badges in the work unit list, helping you prioritize which anomalies to investigate first.

Insight lifecycle

Anomaly insights follow a lifecycle:

Active — A new anomaly was detected and needs attention.
Snoozed — You’ve acknowledged it but want to revisit later. Auto-unsnoozes after the snooze period.
Resolved — Either the condition is no longer true (auto-resolved), you executed an action to fix it (resolved by action), or you manually marked it resolved.
Dismissed — You’ve determined it’s not actionable. LakeSentry tracks dismissals to improve detection.
Superseded — A newer anomaly for the same resource replaced this one.

LakeSentry also supports auto-dismiss rules — configurable patterns that automatically dismiss insights matching certain criteria. This is useful for known exceptions (like monthly batch jobs that always spike).

Next steps

Waste Detection & Insights — How idle resources and waste are identified
Insights & Actions — Viewing and acting on anomalies in the UI
Overview Dashboard — Where anomaly highlights appear