Collector Troubleshooting

This page covers diagnosing and resolving issues with the LakeSentry collector. For initial setup, see Collector Deployment.

Quick diagnosis

Start here when something looks wrong:

Symptom	Likely cause	Jump to
Region connector stuck on “Pending”	Collector never ran or failed on first run	First run failures
Region health shows “Warning” (yellow)	Collector hasn’t pushed data in 2+ hours	Missed runs
Region health shows “Error” (red)	No data in 30+ hours, or validation error	Connection errors
Job run fails in Databricks	Collector error during extraction	Extraction errors
Data appears stale or incomplete	Checkpoint issues or missing table permissions	Checkpoint issues
Some tables missing from “Tables received”	Permission not granted for those tables	Missing tables

First run failures

If the region connector never moves past Pending status:

Check the Databricks job — Go to Workflows > Jobs and verify the collector job exists and has run at least once.
Check the job run status — Click the most recent run. If it failed, read the error output.
Verify the connection string — The connection string is consumed during configure. If configuration failed, generate a new connection string and re-run the configure step.

Common first-run errors:

Error	Cause	Fix
`Connection string is not valid base64`	Malformed or truncated connection string	Generate a new connection string from the region connector
`Connection string is missing required fields`	Incomplete or expired connection string	Generate a new connection string
`LakeSentry API is not reachable`	Network connectivity issue or invalid API URL	Verify the workspace has outbound HTTPS access to LakeSentry
`Module not found: lakesentry_collector`	Wheel not installed or wrong path	Check the wheel file path and re-install

Connection errors

Network connectivity

The collector needs outbound HTTPS access to api.lakesentry.io (or your tenant’s API endpoint). If your Databricks workspace uses a private network or firewall:

Add *.lakesentry.io to the allowlist for outbound HTTPS (port 443).
If using a proxy, configure the HTTPS_PROXY environment variable in the cluster configuration.

Authentication failures

If the collector was working and suddenly fails authentication:

Connection string was regenerated — Someone generated a new connection string, invalidating the old token. Reconfigure the collector with the new string.
Region connector was deleted and re-created — The connector ID changed. Reconfigure with the new connection string.

Extraction errors

Extraction errors occur when the collector fails to read from Databricks system tables.

Permission denied

Error: [TABLE_OR_VIEW_NOT_FOUND] Table or view not found: system.compute.clusters

Error: User does not have permission to SELECT on table system.compute.clusters

The service principal is missing SELECT access to the specified table. Grant permissions as described in Account & Connector Setup.

Unity Catalog not enabled

Error: Unity Catalog is not enabled for this workspace

System tables require Unity Catalog. Enable it on your Databricks account, or use a workspace that already has it enabled.

Table not available

Some system tables are only available in certain Databricks pricing tiers or regions. If a table doesn’t exist:

The collector logs a warning and skips the table.
LakeSentry marks the corresponding feature as unavailable.
Other tables are still extracted normally.

This is expected behavior — the collector continues with whatever tables are accessible.

Missed runs

If the collector misses scheduled runs:

Check the Databricks job schedule — Verify the job is still enabled and the schedule hasn’t been modified.
Check cluster availability — If the job cluster fails to start (e.g., cloud provider capacity issues), the run fails before the collector code executes.
Check for long-running previous runs — If a run takes longer than the schedule interval, Databricks may skip or queue subsequent runs depending on the concurrency setting.

Checkpoint and resumability

The collector uses checkpoint-based incremental extraction. Each table has a watermark that tracks the last successfully extracted position.

How checkpoints work

Table type	Watermark column	Strategy
`system.billing.usage`	`usage_start_time`	Incremental — extract rows after the watermark
`system.query.history`	`start_time`	Incremental
`system.compute.clusters`	`create_time`	Incremental
`system.lakeflow.jobs`	`change_time`	Incremental
`system.compute.node_types`	—	Full snapshot (reference table)

Incremental tables use the watermark to avoid re-extracting data. Full-snapshot tables are extracted completely on each run (they’re small reference tables).

Checkpoint recovery

If a run fails partway through extraction:

Completed tables — Their checkpoints are already updated. The next run skips them.
In-progress table — The checkpoint was not updated. The next run re-extracts from the last successful watermark.
Remaining tables — Not yet attempted. The next run extracts them normally.

This means the collector is crash-safe — you can safely stop a run at any point and the next run recovers automatically. There’s no need to manually reset checkpoints in normal operation.

Resetting checkpoints

In rare cases, you may need to re-extract historical data (e.g., after granting access to a previously inaccessible table). To reset checkpoints:

In LakeSentry, go to the region connector detail page.
Click Reset Checkpoints for the specific table, or Reset All for a full re-extraction.
The next collector run starts from the beginning for those tables.

Missing tables

If the “Tables received” list on the region connector doesn’t include expected tables:

Missing table category	Check
Billing tables	These are account-level. They may only appear for one region if you have multiple regions.
MLflow tables	Requires optional permissions. See optional tables.
Serving tables	Requires optional permissions.
Audit tables	Requires optional permissions.
All regional tables	Verify the service principal has workspace-level access in the target region.

Log interpretation

Collector logs are available in the Databricks job run output. Key log entries to look for:

Log level	Message pattern	Meaning
`INFO`	`Starting collection`	Normal — extraction cycle beginning
`INFO`	`Extracted` (with table, records, status fields)	Normal — successful table extraction
`INFO`	`Extraction batch pushed successfully`	Normal — data sent successfully
`WARN`	`Table not available in workspace, skipping`	Permission not granted or table does not exist
`INFO`	`No records found`	Normal — no data in the extraction window
`ERROR`	`LakeSentry API is not reachable`	Connection string is invalid or API endpoint is down
`ERROR`	`API error response` (with status_code)	LakeSentry API rejected the push
`ERROR`	`Extraction failed` / `Table extraction failed`	Query error on a specific table

Common issues by symptom

Data shows in LakeSentry but appears incomplete

Some workspaces missing — The workspace may be in a region without a region connector. Check the region mapping.
Recent data missing — The collector may have a stuck checkpoint. Check the extraction checkpoints on the region connector detail page.
Specific data type missing — Check if the corresponding system table is in the “Tables received” list.

Collector runs succeed but no data in LakeSentry

Check the LakeSentry ingestion pipeline — Data flows through extraction → ingestion → transformation. If extraction succeeds but data doesn’t appear in dashboards, the issue may be downstream. Check the Connectors page for ingestion errors.
Check for deduplication — If the same data was already ingested (e.g., from a previous run with the same extraction ID), it’s deduplicated silently. This is expected.

Cost data lags behind real-time

Databricks system tables themselves have latency:

Table	Typical Databricks latency
`system.billing.usage`	1-4 hours behind real-time
`system.compute.clusters`	Near real-time
`system.query.history`	Minutes to 1 hour
`system.lakeflow.job_run_timeline`	Minutes to 1 hour

LakeSentry can only show data that Databricks has made available in system tables. If cost data appears delayed, the lag is often at the Databricks level rather than the collector level.

Getting help

If you’ve worked through this guide and the issue persists:

Gather the Databricks job run logs from the failed run.
Note the region connector status and last ingestion time from the LakeSentry Connectors page.
Contact LakeSentry support with these details.

Next steps

Collector Deployment — Setup and configuration reference
Region Connectors — Managing regional data collection
Data Freshness & Pipeline Status — Understanding end-to-end data latency