Table: databricks_pipeline_update - Query Databricks Pipelines using SQL
Databricks Pipelines is a service within Databricks that allows you to build, test, deploy, and manage machine learning workflows. It provides a centralized way to set up and manage pipelines for various Databricks resources, including models, data, and more. Databricks Pipelines helps you stay informed about the health and performance of your machine learning workflows and take appropriate actions when predefined conditions are met.
Table Usage Guide
The databricks_pipeline_update
table provides insights into Pipelines within Databricks. As a Data Scientist or Machine Learning Engineer, explore pipeline-specific details through this table, including status, update details, and associated metadata. Utilize it to uncover information about pipelines, such as those with recent updates, the status of pipelines, and the verification of changes.
Examples
Basic info
Assess the elements within your Databricks pipeline updates to gain insights into the causes and timing of updates. This can be particularly useful in identifying patterns or issues related to pipeline updates in your Databricks account.
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_update;
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_update;
List updates created in the last 7 days
Explore recent changes by identifying updates made within the last week. This can be beneficial in tracking modifications, understanding their causes, and assessing their impact on different accounts and clusters.
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere creation_time >= now() - interval '7' day;
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere creation_time >= datetime('now', '-7 day');
List updates caused by an API call
Discover the updates triggered by an API call. This is particularly useful for tracking changes and auditing purposes, as it allows you to see which updates were not user-initiated but were instead triggered by an API call.
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere cause = 'API_CALL';
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere cause = 'API_CALL';
List all failed updates
Explore which updates failed in your Databricks pipeline. This is useful for identifying problematic updates and understanding the cause of their failure.
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere state = 'FAILED';
select update_id, pipeline_id, cause, cluster_id, creation_time, account_idfrom databricks_pipeline_updatewhere state = 'FAILED';
List all pipelines that require full refresh before each run
Explore which pipelines necessitate a full refresh prior to each run, aiding in resource allocation and ensuring efficient pipeline management. This can be particularly useful in scenarios where pipeline performance is crucial and resources are limited.
select update_id, pipeline_id, cause, cluster_id, creation_time, full_refresh_selection, account_idfrom databricks_pipeline_updatewhere full_refresh;
select update_id, pipeline_id, cause, cluster_id, creation_time, full_refresh_selection, account_idfrom databricks_pipeline_updatewhere full_refresh = 1;
Find the account with the most pipeline updates
Identify the account with the highest frequency of pipeline updates. This can be useful in understanding which account is most actively managing and modifying their pipelines.
select account_id, count(*) as update_countfrom databricks_pipeline_updategroup by account_idorder by update_count desclimit 1;
select account_id, count(*) as update_countfrom databricks_pipeline_updategroup by account_idorder by update_count desclimit 1;
Schema for databricks_pipeline_update
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | |
account_id | text | The Databricks Account ID in which the resource is located. | |
cause | text | What triggered this update. | |
cluster_id | text | The ID of the cluster that the update is running on. | |
config | jsonb | The pipeline configuration with system defaults applied where unspecified by the user. | |
creation_time | timestamp with time zone | The time when this update was created. | |
full_refresh | boolean | Whether to reset all tables before running the pipeline. | |
full_refresh_selection | jsonb | A list of tables to update with full refresh. | |
pipeline_id | text | = | Unique identifier of pipeline. |
refresh_selection | jsonb | A list of tables to update without full refresh. | |
state | text | The current state of the pipeline. | |
title | text | The title of the resource. | |
update_id | text | = | Unique identifier of the update. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks
You can pass the configuration to the command with the --config
argument:
steampipe_export_databricks --config '<your_config>' databricks_pipeline_update