Model Serving
The Model Serving page tracks costs for Databricks model serving endpoints — the infrastructure that hosts your ML models and foundation models for real-time inference. As organizations deploy more models and adopt foundation model APIs, serving costs can grow quickly and unpredictably. This page helps you understand where that spend is going.
Endpoint list
Section titled “Endpoint list”The main view shows all serving endpoints with their metrics:
| Column | What it shows |
|---|---|
| Endpoint | Endpoint name with workspace shown below (falls back to endpoint ID if name is unknown) |
| Type | Entity type: Foundation Model, Custom Model, External Model, or Feature Spec |
| Requests | Number of inference requests for the selected period |
| Latency | Average request latency in milliseconds |
| Tokens | Input and output token counts for the selected period |
| Cost | DBU cost and cloud cost for the selected period |
Entity types
Section titled “Entity types”LakeSentry tracks four types of served entities:
| Type | What it is |
|---|---|
| Foundation Model | Databricks-hosted foundation models (pay-per-token) |
| Custom Model | Your own models registered in Unity Catalog |
| External Model | Models hosted by external providers, routed through Databricks |
| Feature Spec | Feature serving endpoints for online feature stores |
Filtering
Section titled “Filtering”| Filter | Options |
|---|---|
| Workspace | Specific Databricks workspace |
| Entity type | Foundation Model, Custom Model, External Model, Feature Spec |
| Search | Search by endpoint name, endpoint ID, or owner |
| Time range | Analysis period |
Sorting
Section titled “Sorting”- Request count (descending) — Find the most active endpoints
Page-level summary
Section titled “Page-level summary”The page header shows aggregate stats across all visible endpoints:
| Metric | What it shows |
|---|---|
| Endpoints | Total number of serving endpoints |
| Requests | Total inference requests for the selected period |
| Avg Latency | Weighted average request latency in milliseconds |
| Cost | Aggregate spend for the selected period (DBU + cloud cost) |
Top endpoints trend
Section titled “Top endpoints trend”A stacked bar chart showing request volume over time for the top endpoints by request count. The chart supports day, week, and month granularity.
Requester breakdown
Section titled “Requester breakdown”Click any endpoint in the table to see its top requesters. For endpoints shared by multiple users or applications, the requester breakdown shows who is consuming the endpoint:
| Column | What it shows |
|---|---|
| Requester | User or service principal making requests |
| Requests | Number of requests from this requester |
| Tokens in/out | Input and output tokens consumed by this requester |
This is particularly useful for shared foundation model endpoints where multiple teams or applications submit requests.
Waste detection
Section titled “Waste detection”LakeSentry’s waste detection system monitors serving endpoints for:
| Pattern | What it means |
|---|---|
| Zombie model | Endpoint with no inference requests for 90+ days but still incurring cost. Endpoints with over $100 in cost during the inactive period are flagged. |
Zombie model findings appear on the Insights & Actions page with estimated savings from terminating the unused endpoint.
Configuration tracking
Section titled “Configuration tracking”LakeSentry records endpoint configuration changes over time:
- Endpoint creation and deletion dates
- Configuration version changes
- Entity version updates (model version deployments)
This history helps you correlate cost changes with configuration events — for example, a cost increase that coincides with a new model version deployment or a scaling configuration change.
Common workflows
Section titled “Common workflows”Finding active endpoints
Section titled “Finding active endpoints”- Endpoints are sorted by Request count (descending) by default.
- Note the entity type — foundation models and custom models have different cost drivers.
- For foundation models, check the requester breakdown to see who’s driving token consumption.
- Review the cost columns (DBU cost, cloud cost) to identify the most expensive endpoints.
Investigating a cost spike
Section titled “Investigating a cost spike”- Check the top endpoints trend chart to see whether request volume changed.
- Click into the endpoint showing the spike to load its requester breakdown.
- Check the requester breakdown for any new or unusually active requesters.
- If traffic is flat but cost changed, investigate configuration history for recent changes.
Identifying idle endpoints
Section titled “Identifying idle endpoints”- Check the Insights & Actions page for zombie model findings.
- Look for endpoints with zero request counts over the selected time range.
- Review with endpoint owners before taking action.
Next steps
Section titled “Next steps”- MLflow — ML experiment and training cost tracking
- Compute (Clusters & Warehouses) — Underlying compute for custom model serving
- Insights & Actions — Zombie endpoint and optimization findings
- Waste Detection & Insights — How waste patterns are detected