Collector Deployment
The LakeSentry collector is a lightweight Python package that runs as a Databricks job in your workspace. It reads system tables, extracts the data, and pushes it to LakeSentry over HTTPS. You need one collector per region where you operate Databricks workspaces.
For background on why collectors are regional, see Region Connectors.
How the collector works
Section titled “How the collector works”The collector runs on a schedule (every 15 minutes by default) and performs incremental extraction:
- Reads system tables — Queries
system.billing.*,system.compute.*,system.lakeflow.*,system.query.*, and other configured tables. - Applies checkpoints — Uses watermark columns (e.g.,
usage_start_timefor billing,start_timefor queries) to extract only data since the last run. Small reference tables (like price lists and node types) are extracted as full snapshots. - Pushes to LakeSentry — Sends the extracted data over HTTPS with an extraction ID for deduplication.
- Updates checkpoints — Saves the new watermark positions so the next run picks up where this one left off.
Each cycle takes about 5 minutes depending on the volume of data. Re-running from the same checkpoint is safe — LakeSentry deduplicates based on extraction IDs.
Prerequisites
Section titled “Prerequisites”- A region connector configured in LakeSentry with a connection string (see Account & Connector Setup)
- A Databricks workspace in the target region with Unity Catalog enabled
- Permissions to create jobs and upload files in the workspace
- The service principal from your account connector must have access to system tables (see granting permissions)
Step 1: Upload the collector
Section titled “Step 1: Upload the collector”Download the latest collector wheel from LakeSentry:
- In LakeSentry, go to Settings > Connectors.
- Click Download Collector to get the
.whlfile.
Upload it to your Databricks workspace:
- In Databricks, navigate to Workspace > your preferred directory (e.g.,
/Shared/lakesentry/). - Upload the
.whlfile.
Alternatively, upload to a Unity Catalog volume:
/Volumes/<catalog>/<schema>/lakesentry/lakesentry_collector-<version>-py3-none-any.whlStep 2: Configure the collector
Section titled “Step 2: Configure the collector”Run the configuration command to set up the collector environment. You can do this from a Databricks notebook or a one-time job:
%pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whllakesentry-collector configure --connection-string "LAKESENTRY://..."The configure command:
- Validates the connection string
- Verifies LakeSentry API connectivity
- Stores configuration in a
.envfile for the collector to use
Step 3: Create the Databricks job
Section titled “Step 3: Create the Databricks job”Create a scheduled job that runs the collector:
- In Databricks, go to Workflows > Jobs.
- Click Create Job.
- Configure the job:
| Setting | Value |
|---|---|
| Task name | lakesentry-collector |
| Type | Python wheel |
| Package name | lakesentry_collector |
| Entry point | lakesentry-collector |
| Cluster | Use an existing shared cluster, or create a single-node job cluster |
| Schedule | Every 15 minutes |
Cluster sizing
Section titled “Cluster sizing”The collector is lightweight. A single-node cluster with minimal resources is sufficient:
| Setting | Recommendation |
|---|---|
| Node type | Smallest available (e.g., Standard_DS3_v2 on Azure, m5.large on AWS) |
| Workers | 0 (single-node / driver-only) |
| Autoscaling | Off |
| Auto-termination | 10 minutes (cluster terminates between runs) |
Alternative: Notebook task
Section titled “Alternative: Notebook task”If you prefer a notebook-based approach:
- Create a notebook with:
%pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whlfrom lakesentry_collector.cli import mainmain()
- Schedule the notebook as a job (every 15 minutes).
Step 4: Start the schedule
Section titled “Step 4: Start the schedule”- On the job page, ensure the schedule is enabled.
- Optionally, click Run Now to trigger the first collection immediately.
- Monitor the first run to confirm it completes without errors.
Verifying the deployment
Section titled “Verifying the deployment”After the first successful run, verify in both Databricks and LakeSentry:
In Databricks
Section titled “In Databricks”- The job run shows Succeeded status.
- Run duration is approximately 3-7 minutes.
- No errors in the task output logs.
In LakeSentry
Section titled “In LakeSentry”Go to Settings > Connectors and check the region connector:
| Indicator | Expected state |
|---|---|
| Status | OK (green) |
| Last ingestion | Recent timestamp matching the job run |
| Tables received | Lists the system tables that were extracted |
If the region stays in Pending status, see Collector Troubleshooting.
Monitoring collector health
Section titled “Monitoring collector health”LakeSentry monitors collector health automatically:
- Connector health checks run hourly. If no data is received for 30+ hours, a health alert is triggered (sent via email to admins and via webhook if configured).
- Collector runs history is visible on the region connector detail page, showing each run’s status, duration, tables extracted, and rows pushed.
- Extraction checkpoints show the current watermark position for each table, so you can see how far behind the collector is if it’s been paused.
Updating the collector
Section titled “Updating the collector”When a new collector version is available:
- Download the updated
.whlfile from LakeSentry. - Upload it to the same location in your Databricks workspace (overwriting the previous version).
- The next scheduled run uses the updated collector automatically.
No reconfiguration is needed — the collector version is independent of the connection string and configuration.
Schedule recommendations
Section titled “Schedule recommendations”| Scenario | Schedule | Trade-off |
|---|---|---|
| Standard (default) | Every 15 minutes | Good balance of freshness and cost |
| Near-real-time | Every 5 minutes | Fresher data, higher compute cost |
| Cost-conscious | Every 30-60 minutes | Lower cost, longer data lag |
The collector schedule directly affects data freshness. With a 15-minute schedule, cost data is at most ~30 minutes old (15 minutes for extraction plus processing time).
Multiple collectors per region
Section titled “Multiple collectors per region”In most cases, one collector per region is sufficient. However, if you have a very high volume of system table data (thousands of active jobs, heavy query history), you can deploy multiple collectors in the same region with different table assignments. Contact LakeSentry support for guidance on partitioned collection.
Uninstalling the collector
Section titled “Uninstalling the collector”To stop data collection for a region:
- Disable or delete the Databricks job.
- Remove the collector files from the workspace (optional).
- If no data is received for 30+ hours, a health alert is sent to admins via email (and via webhook if configured).
To fully disconnect, also delete the region connector in LakeSentry. See Region Connectors for details.
Next steps
Section titled “Next steps”- Collector Troubleshooting — Diagnosing and fixing collector issues
- Region Connectors — Managing multi-region deployments
- How LakeSentry Works — Understanding the data pipeline from extraction to insights