Skip to content

Model Serving

The Model Serving page tracks costs for Databricks model serving endpoints — the infrastructure that hosts your ML models and foundation models for real-time inference. As organizations deploy more models and adopt foundation model APIs, serving costs can grow quickly and unpredictably. This page helps you understand where that spend is going.

The main view shows all serving endpoints with their metrics:

ColumnWhat it shows
EndpointEndpoint name with workspace shown below (falls back to endpoint ID if name is unknown)
TypeEntity type: Foundation Model, Custom Model, External Model, or Feature Spec
RequestsNumber of inference requests for the selected period
LatencyAverage request latency in milliseconds
TokensInput and output token counts for the selected period
CostDBU cost and cloud cost for the selected period

LakeSentry tracks four types of served entities:

TypeWhat it is
Foundation ModelDatabricks-hosted foundation models (pay-per-token)
Custom ModelYour own models registered in Unity Catalog
External ModelModels hosted by external providers, routed through Databricks
Feature SpecFeature serving endpoints for online feature stores
FilterOptions
WorkspaceSpecific Databricks workspace
Entity typeFoundation Model, Custom Model, External Model, Feature Spec
SearchSearch by endpoint name, endpoint ID, or owner
Time rangeAnalysis period
  • Request count (descending) — Find the most active endpoints

The page header shows aggregate stats across all visible endpoints:

MetricWhat it shows
EndpointsTotal number of serving endpoints
RequestsTotal inference requests for the selected period
Avg LatencyWeighted average request latency in milliseconds
CostAggregate spend for the selected period (DBU + cloud cost)

A stacked bar chart showing request volume over time for the top endpoints by request count. The chart supports day, week, and month granularity.

Click any endpoint in the table to see its top requesters. For endpoints shared by multiple users or applications, the requester breakdown shows who is consuming the endpoint:

ColumnWhat it shows
RequesterUser or service principal making requests
RequestsNumber of requests from this requester
Tokens in/outInput and output tokens consumed by this requester

This is particularly useful for shared foundation model endpoints where multiple teams or applications submit requests.

LakeSentry’s waste detection system monitors serving endpoints for:

PatternWhat it means
Zombie modelEndpoint with no inference requests for 90+ days but still incurring cost. Endpoints with over $100 in cost during the inactive period are flagged.

Zombie model findings appear on the Insights & Actions page with estimated savings from terminating the unused endpoint.

LakeSentry records endpoint configuration changes over time:

  • Endpoint creation and deletion dates
  • Configuration version changes
  • Entity version updates (model version deployments)

This history helps you correlate cost changes with configuration events — for example, a cost increase that coincides with a new model version deployment or a scaling configuration change.

  1. Endpoints are sorted by Request count (descending) by default.
  2. Note the entity type — foundation models and custom models have different cost drivers.
  3. For foundation models, check the requester breakdown to see who’s driving token consumption.
  4. Review the cost columns (DBU cost, cloud cost) to identify the most expensive endpoints.
  1. Check the top endpoints trend chart to see whether request volume changed.
  2. Click into the endpoint showing the spike to load its requester breakdown.
  3. Check the requester breakdown for any new or unusually active requesters.
  4. If traffic is flat but cost changed, investigate configuration history for recent changes.
  1. Check the Insights & Actions page for zombie model findings.
  2. Look for endpoints with zero request counts over the selected time range.
  3. Review with endpoint owners before taking action.