Collector Deployment

The LakeSentry collector is a lightweight Python package that runs as a Databricks job in your workspace. It reads system tables, extracts the data, and pushes it to LakeSentry over HTTPS. You need one collector per region where you operate Databricks workspaces.

For background on why collectors are regional, see Region Connectors.

How the collector works

The collector runs on a schedule (every 15 minutes by default) and performs incremental extraction:

Reads system tables — Queries system.billing.*, system.compute.*, system.lakeflow.*, system.query.*, and other configured tables.
Applies checkpoints — Uses watermark columns (e.g., usage_start_time for billing, start_time for queries) to extract only data since the last run. Small reference tables (like price lists and node types) are extracted as full snapshots.
Pushes to LakeSentry — Sends the extracted data over HTTPS with an extraction ID for deduplication.
Updates checkpoints — Saves the new watermark positions so the next run picks up where this one left off.

Each cycle takes about 5 minutes depending on the volume of data. Re-running from the same checkpoint is safe — LakeSentry deduplicates based on extraction IDs.

Prerequisites

A region connector configured in LakeSentry with a connection string (see Account & Connector Setup)
A Databricks workspace in the target region with Unity Catalog enabled
Permissions to create jobs and upload files in the workspace
The service principal from your account connector must have access to system tables (see granting permissions)

Step 1: Upload the collector

Download the latest collector wheel from LakeSentry:

In LakeSentry, go to Settings > Connectors.
Click Download Collector to get the .whl file.

Upload it to your Databricks workspace:

In Databricks, navigate to Workspace > your preferred directory (e.g., /Shared/lakesentry/).
Upload the .whl file.

Alternatively, upload to a Unity Catalog volume:

/Volumes/<catalog>/<schema>/lakesentry/lakesentry_collector-<version>-py3-none-any.whl

Step 2: Configure the collector

Run the configuration command to set up the collector environment. You can do this from a Databricks notebook or a one-time job:

%pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whl

lakesentry-collector configure --connection-string "LAKESENTRY://..."

The configure command:

Validates the connection string
Verifies LakeSentry API connectivity
Stores configuration in a .env file for the collector to use

Step 3: Create the Databricks job

Create a scheduled job that runs the collector:

In Databricks, go to Workflows > Jobs.
Click Create Job.
Configure the job:

Setting	Value
Task name	`lakesentry-collector`
Type	Python wheel
Package name	`lakesentry_collector`
Entry point	`lakesentry-collector`
Cluster	Use an existing shared cluster, or create a single-node job cluster
Schedule	Every 15 minutes

Cluster sizing

The collector is lightweight. A single-node cluster with minimal resources is sufficient:

Setting	Recommendation
Node type	Smallest available (e.g., `Standard_DS3_v2` on Azure, `m5.large` on AWS)
Workers	0 (single-node / driver-only)
Autoscaling	Off
Auto-termination	10 minutes (cluster terminates between runs)

Alternative: Notebook task

If you prefer a notebook-based approach:

Create a notebook with:

%pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whl
from lakesentry_collector.cli import main
main()

Schedule the notebook as a job (every 15 minutes).

Step 4: Start the schedule

On the job page, ensure the schedule is enabled.
Optionally, click Run Now to trigger the first collection immediately.
Monitor the first run to confirm it completes without errors.

Verifying the deployment

After the first successful run, verify in both Databricks and LakeSentry:

In Databricks

The job run shows Succeeded status.
Run duration is approximately 3-7 minutes.
No errors in the task output logs.

In LakeSentry

Go to Settings > Connectors and check the region connector:

Indicator	Expected state
Status	OK (green)
Last ingestion	Recent timestamp matching the job run
Tables received	Lists the system tables that were extracted

If the region stays in Pending status, see Collector Troubleshooting.

Monitoring collector health

LakeSentry monitors collector health automatically:

Connector health checks run hourly. If no data is received for 30+ hours, a health alert is triggered (sent via email to admins and via webhook if configured).
Collector runs history is visible on the region connector detail page, showing each run’s status, duration, tables extracted, and rows pushed.
Extraction checkpoints show the current watermark position for each table, so you can see how far behind the collector is if it’s been paused.

Updating the collector

When a new collector version is available:

Download the updated .whl file from LakeSentry.
Upload it to the same location in your Databricks workspace (overwriting the previous version).
The next scheduled run uses the updated collector automatically.

No reconfiguration is needed — the collector version is independent of the connection string and configuration.

Schedule recommendations

Scenario	Schedule	Trade-off
Standard (default)	Every 15 minutes	Good balance of freshness and cost
Near-real-time	Every 5 minutes	Fresher data, higher compute cost
Cost-conscious	Every 30-60 minutes	Lower cost, longer data lag

The collector schedule directly affects data freshness. With a 15-minute schedule, cost data is at most ~30 minutes old (15 minutes for extraction plus processing time).

Multiple collectors per region

In most cases, one collector per region is sufficient. However, if you have a very high volume of system table data (thousands of active jobs, heavy query history), you can deploy multiple collectors in the same region with different table assignments. Contact LakeSentry support for guidance on partitioned collection.

Uninstalling the collector

To stop data collection for a region:

Disable or delete the Databricks job.
Remove the collector files from the workspace (optional).
If no data is received for 30+ hours, a health alert is sent to admins via email (and via webhook if configured).

To fully disconnect, also delete the region connector in LakeSentry. See Region Connectors for details.

Next steps

Collector Troubleshooting — Diagnosing and fixing collector issues
Region Connectors — Managing multi-region deployments
How LakeSentry Works — Understanding the data pipeline from extraction to insights