Skip to content

Cost Attribution & Confidence Tiers

Cost attribution is how LakeSentry connects Databricks spend to the teams and people responsible for it. Rather than forcing every dollar into a bucket, LakeSentry uses confidence tiers to tell you how much to trust each attribution — so your chargeback numbers hold up under scrutiny.

LakeSentry uses a dual-axis model for cost allocation:

  • Vertical axis (Accountability) — Who is financially responsible? This maps to your organizational hierarchy: org units, departments, and teams.
  • Horizontal axis (Context) — Why was the cost incurred? This uses optional categories like projects or shared infrastructure buckets.

Every usage line item gets evaluated against attribution rules in priority order. The first matching rule wins. If no rules match, a waterfall fallback determines attribution based on user identity and resource ownership.

Each cost allocation carries a confidence tier that tells you how reliably the attribution was determined:

TierWhat it meansHow it’s determined
ExactDirect identifier links cost to a specific workloadjob_run_id or similar identifier in the billing metadata directly maps to a known work unit
StrongExplicit linkage through query or session metadataQuery source metadata links to a job, plus clear compute mapping
EstimatedTime-overlap correlation with limited candidatesMultiple possible attributions; allocated based on time overlap and compute usage proportions
UnattributedNo reliable linkage foundCould not determine who or what caused this cost

Attribution rules are the primary mechanism for mapping costs to owners. You create rules that match billing records and assign them to teams. Rules are evaluated in priority order — lower priority number means higher precedence.

TypeUse caseExample
ExactKnown high-cost resources”Cluster 0123-456789-abcdef belongs to the ML team”
PatternCategories of resources”All clusters matching prod-* belong to Platform”
ProportionalPlatform overhead”Distribute NETWORKING and DATABASE costs across teams by their compute spend”

Match a specific resource by type and ID. Use these for resources you know the owner of — a dedicated training cluster, a specific production job.

Exact rules always require a workspace (since resource IDs are workspace-scoped).

Match resources by criteria. All conditions use AND logic — every specified condition must match. Conditions you don’t specify are treated as “match anything.”

Available match criteria:

CriterionWhat it matches
Resource typecluster, warehouse, job, pipeline, endpoint, app
Resource patternRegex against resource name or ID (e.g., ^prod-.*)
Principal domainEmail domain suffix of the user (e.g., @analytics.company.com)
TagsDatabricks custom tags — key-value pairs that must all match

Pattern rules can be global (apply across all workspaces) by leaving the workspace unset. This is useful for organization-wide tag mappings.

Distribute platform overhead costs across teams based on their compute spend. Use these for costs that don’t belong to any single team — networking, database (Delta storage), predictive optimization.

The distribution is proportional: if Team A accounts for 60% of compute spend and Team B accounts for 40%, a proportional rule splits the overhead 60/40.

When a rule matches, it uses one of these attribution modes:

Assigns 100% of the cost to a single team, optionally with a category. Direct mode rules can also mark a resource as shared infrastructure — you can assign a shared bucket label (like shared:platform:analytics) to group related shared costs together for reporting.

Distributes cost across multiple teams by percentage. Percentages must sum to 100%. Each allocation can optionally include a category.

You can have up to 20 splits per rule. The UI provides a “Distribute evenly” helper to auto-balance percentages.

Used with proportional rules to distribute platform overhead costs across teams. The distribution is based on each team’s compute spend as a proportion of total compute cost.

When a billing record arrives, LakeSentry evaluates it through this sequence:

  1. Session-based attribution — For shared compute (SQL Serverless warehouses and ALL_PURPOSE clusters), if session allocations exist, split costs among actual users proportionally based on query duration or command count. If match, apply and stop.
  2. Proportional rules — For overhead categories (networking, database, predictive optimization), match proportional rules by SKU pattern. If match, distribute to teams by compute spend and stop.
  3. Exact and pattern rules (priority order) — All non-proportional rules are evaluated together in priority order (lower priority number = higher precedence). Exact rules match by resource type + resource ID; pattern rules match by tags, resource pattern, or principal domain. First match wins and evaluation stops.
  4. Waterfall fallback — If nothing matched, fall through a priority chain:
    1. Is the resource marked as shared? → attribute to the owner’s team as shared
    2. Does the user have a team mapping? → attribute via user
    3. Does the resource owner have a team mapping? → attribute via owner
    4. Does the user exist but have no team? → attribute to user (no team)
    5. None of the above → unattributed (workspace-level)

Some Databricks resources are shared by multiple users within the same billing period. SQL Serverless warehouses serve queries from many users, and ALL_PURPOSE clusters run commands from different notebooks.

For these cases, LakeSentry splits costs proportionally among actual users:

Resource typeAttribution metricHow it works
SQL Serverless warehousesQuery durationUsers running longer queries get a larger share
ALL_PURPOSE clustersCommand countUsers running more commands get a larger share

Sessions are detected using a 2-hour gap rule — a gap of more than 2 hours between consecutive billing records starts a new session. Within each session, user activity is summed and proportionally allocated.

Priority rangeRecommended use
1–50Critical exact matches for known high-cost resources
50–100Specific pattern rules
100–200General pattern rules
200+Catch-all and fallback rules

When multiple rules share the same priority, workspace-specific rules take precedence over global rules.

Use the Simulate tab on the Attribution page to test a rule against historical data before activating it. The simulation shows how many resources would match, the record count, and the total cost affected over the period you specify.

Rules can have optional start and end dates. Use these for:

  • Temporary overrides — “Attribute all ML training to R&D during Q4”
  • Migrations — Old rule valid until Dec 31, new rule starts Jan 1
  • Retroactive corrections — Set the start date in the past to fix historical attribution

The Tags tab on the Attribution page provides a shortcut for the most common pattern rule scenario: mapping a Databricks tag key+value to a team. Select a tag key, see all values and their costs, then assign a team from a dropdown. LakeSentry creates a global pattern rule for you.

If a significant portion of your costs are unattributed, here are ways to improve coverage:

  1. Check the Unallocated Costs page — See which resources and cost categories aren’t matched by any rule.
  2. Create exact rules for your top unattributed resources — a few rules can cover a large portion of spend.
  3. Use tag-based mapping — If your Databricks resources have consistent tags (cost_center, team, env), map those tags to teams.
  4. Set up identity mappings — Map Databricks principals (user emails, service principals) to teams in the Organizational Hierarchy.
  5. Add pattern rules for naming conventions — If your clusters follow patterns like prod-analytics-*, create pattern rules to match them.