Skip to content

Collector Deployment

The LakeSentry collector is a lightweight Python package that runs as a Databricks job in your workspace. It reads system tables, extracts the data, and pushes it to LakeSentry over HTTPS. You need one collector per region where you operate Databricks workspaces.

For background on why collectors are regional, see Region Connectors.

The collector runs on a schedule (every 15 minutes by default) and performs incremental extraction:

  1. Reads system tables — Queries system.billing.*, system.compute.*, system.lakeflow.*, system.query.*, and other configured tables.
  2. Applies checkpoints — Uses watermark columns (e.g., usage_start_time for billing, start_time for queries) to extract only data since the last run. Small reference tables (like price lists and node types) are extracted as full snapshots.
  3. Pushes to LakeSentry — Sends the extracted data over HTTPS with an extraction ID for deduplication.
  4. Updates checkpoints — Saves the new watermark positions so the next run picks up where this one left off.

Each cycle takes about 5 minutes depending on the volume of data. Re-running from the same checkpoint is safe — LakeSentry deduplicates based on extraction IDs.

  • A region connector configured in LakeSentry with a connection string (see Account & Connector Setup)
  • A Databricks workspace in the target region with Unity Catalog enabled
  • Permissions to create jobs and upload files in the workspace
  • The service principal from your account connector must have access to system tables (see granting permissions)

Download the latest collector wheel from LakeSentry:

  1. In LakeSentry, go to Settings > Connectors.
  2. Click Download Collector to get the .whl file.

Upload it to your Databricks workspace:

  1. In Databricks, navigate to Workspace > your preferred directory (e.g., /Shared/lakesentry/).
  2. Upload the .whl file.

Alternatively, upload to a Unity Catalog volume:

/Volumes/<catalog>/<schema>/lakesentry/lakesentry_collector-<version>-py3-none-any.whl

Run the configuration command to set up the collector environment. You can do this from a Databricks notebook or a one-time job:

Terminal window
%pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whl
Terminal window
lakesentry-collector configure --connection-string "LAKESENTRY://..."

The configure command:

  • Validates the connection string
  • Verifies LakeSentry API connectivity
  • Stores configuration in a .env file for the collector to use

Create a scheduled job that runs the collector:

  1. In Databricks, go to Workflows > Jobs.
  2. Click Create Job.
  3. Configure the job:
SettingValue
Task namelakesentry-collector
TypePython wheel
Package namelakesentry_collector
Entry pointlakesentry-collector
ClusterUse an existing shared cluster, or create a single-node job cluster
ScheduleEvery 15 minutes

The collector is lightweight. A single-node cluster with minimal resources is sufficient:

SettingRecommendation
Node typeSmallest available (e.g., Standard_DS3_v2 on Azure, m5.large on AWS)
Workers0 (single-node / driver-only)
AutoscalingOff
Auto-termination10 minutes (cluster terminates between runs)

If you prefer a notebook-based approach:

  1. Create a notebook with:
    %pip install /Workspace/Shared/lakesentry/lakesentry_collector-<version>-py3-none-any.whl
    from lakesentry_collector.cli import main
    main()
  2. Schedule the notebook as a job (every 15 minutes).
  1. On the job page, ensure the schedule is enabled.
  2. Optionally, click Run Now to trigger the first collection immediately.
  3. Monitor the first run to confirm it completes without errors.

After the first successful run, verify in both Databricks and LakeSentry:

  • The job run shows Succeeded status.
  • Run duration is approximately 3-7 minutes.
  • No errors in the task output logs.

Go to Settings > Connectors and check the region connector:

IndicatorExpected state
StatusOK (green)
Last ingestionRecent timestamp matching the job run
Tables receivedLists the system tables that were extracted

If the region stays in Pending status, see Collector Troubleshooting.

LakeSentry monitors collector health automatically:

  • Connector health checks run hourly. If no data is received for 30+ hours, a health alert is triggered (sent via email to admins and via webhook if configured).
  • Collector runs history is visible on the region connector detail page, showing each run’s status, duration, tables extracted, and rows pushed.
  • Extraction checkpoints show the current watermark position for each table, so you can see how far behind the collector is if it’s been paused.

When a new collector version is available:

  1. Download the updated .whl file from LakeSentry.
  2. Upload it to the same location in your Databricks workspace (overwriting the previous version).
  3. The next scheduled run uses the updated collector automatically.

No reconfiguration is needed — the collector version is independent of the connection string and configuration.

ScenarioScheduleTrade-off
Standard (default)Every 15 minutesGood balance of freshness and cost
Near-real-timeEvery 5 minutesFresher data, higher compute cost
Cost-consciousEvery 30-60 minutesLower cost, longer data lag

The collector schedule directly affects data freshness. With a 15-minute schedule, cost data is at most ~30 minutes old (15 minutes for extraction plus processing time).

In most cases, one collector per region is sufficient. However, if you have a very high volume of system table data (thousands of active jobs, heavy query history), you can deploy multiple collectors in the same region with different table assignments. Contact LakeSentry support for guidance on partitioned collection.

To stop data collection for a region:

  1. Disable or delete the Databricks job.
  2. Remove the collector files from the workspace (optional).
  3. If no data is received for 30+ hours, a health alert is sent to admins via email (and via webhook if configured).

To fully disconnect, also delete the region connector in LakeSentry. See Region Connectors for details.