Model Serving

The Model Serving page tracks costs for Databricks model serving endpoints — the infrastructure that hosts your ML models and foundation models for real-time inference. As organizations deploy more models and adopt foundation model APIs, serving costs can grow quickly and unpredictably. This page helps you understand where that spend is going.

Endpoint list

The main view shows all serving endpoints with their metrics:

Column	What it shows
Endpoint	Endpoint name with workspace shown below (falls back to endpoint ID if name is unknown)
Type	Entity type: Foundation Model, Custom Model, External Model, or Feature Spec
Requests	Number of inference requests for the selected period
Latency	Average request latency in milliseconds
Tokens	Input and output token counts for the selected period
Cost	DBU cost and cloud cost for the selected period

Entity types

LakeSentry tracks four types of served entities:

Type	What it is
Foundation Model	Databricks-hosted foundation models (pay-per-token)
Custom Model	Your own models registered in Unity Catalog
External Model	Models hosted by external providers, routed through Databricks
Feature Spec	Feature serving endpoints for online feature stores

Filtering

Filter	Options
Workspace	Specific Databricks workspace
Entity type	Foundation Model, Custom Model, External Model, Feature Spec
Search	Search by endpoint name, endpoint ID, or owner
Time range	Analysis period

Sorting

Request count (descending) — Find the most active endpoints

Page-level summary

The page header shows aggregate stats across all visible endpoints:

Metric	What it shows
Endpoints	Total number of serving endpoints
Requests	Total inference requests for the selected period
Avg Latency	Weighted average request latency in milliseconds
Cost	Aggregate spend for the selected period (DBU + cloud cost)

Top endpoints trend

A stacked bar chart showing request volume over time for the top endpoints by request count. The chart supports day, week, and month granularity.

Requester breakdown

Click any endpoint in the table to see its top requesters. For endpoints shared by multiple users or applications, the requester breakdown shows who is consuming the endpoint:

Column	What it shows
Requester	User or service principal making requests
Requests	Number of requests from this requester
Tokens in/out	Input and output tokens consumed by this requester

This is particularly useful for shared foundation model endpoints where multiple teams or applications submit requests.

Waste detection

LakeSentry’s waste detection system monitors serving endpoints for:

Pattern	What it means
Zombie model	Endpoint with no inference requests for 90+ days but still incurring cost. Endpoints with over $100 in cost during the inactive period are flagged.

Zombie model findings appear on the Insights & Actions page with estimated savings from terminating the unused endpoint.

Configuration tracking

LakeSentry records endpoint configuration changes over time:

Endpoint creation and deletion dates
Configuration version changes
Entity version updates (model version deployments)

This history helps you correlate cost changes with configuration events — for example, a cost increase that coincides with a new model version deployment or a scaling configuration change.

Common workflows

Finding active endpoints

Endpoints are sorted by Request count (descending) by default.
Note the entity type — foundation models and custom models have different cost drivers.
For foundation models, check the requester breakdown to see who’s driving token consumption.
Review the cost columns (DBU cost, cloud cost) to identify the most expensive endpoints.

Investigating a cost spike

Check the top endpoints trend chart to see whether request volume changed.
Click into the endpoint showing the spike to load its requester breakdown.
Check the requester breakdown for any new or unusually active requesters.
If traffic is flat but cost changed, investigate configuration history for recent changes.

Identifying idle endpoints

Check the Insights & Actions page for zombie model findings.
Look for endpoints with zero request counts over the selected time range.
Review with endpoint owners before taking action.

Next steps

MLflow — ML experiment and training cost tracking
Compute (Clusters & Warehouses) — Underlying compute for custom model serving
Insights & Actions — Zombie endpoint and optimization findings
Waste Detection & Insights — How waste patterns are detected