Table: databricks_pipeline - Query Databricks Pipelines using SQL
Databricks Pipelines are a set of tools within the Databricks platform that allows you to build, test, and deploy machine learning workflows. It provides a unified way to automate the end-to-end machine learning lifecycle, including feature extraction, model training and testing, and model deployment. Databricks Pipelines help you manage and streamline your machine learning workflows, ensuring reproducibility and facilitating collaboration.
Table Usage Guide
The databricks_pipeline
table provides insights into Pipelines within Databricks. As a Data Scientist or Machine Learning Engineer, explore pipeline-specific details through this table, including configuration, status, and associated metadata. Utilize it to manage and streamline your machine learning workflows, ensuring reproducibility and facilitating collaboration.
Examples
Basic info
Explore which pipelines are active in your Databricks environment, identifying their associated cluster and creator. This can help in managing resources and understanding user activity.
select pipeline_id, name, cluster_id, creator_user_name, state, edition, account_idfrom databricks_pipeline;
select pipeline_id, name, cluster_id, creator_user_name, state, edition, account_idfrom databricks_pipeline;
List pipelines that failed to start
Identify instances where certain pipelines have failed to initiate. This provides insights into potential issues within your system, allowing you to troubleshoot and resolve these problems effectively.
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere state = 'FAILED';
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere state = 'FAILED';
Get cluster details associated with each pipeline
Analyze the settings to understand the relationship between different pipelines and their associated clusters in Databricks. This can help in resource allocation and optimization by providing insights into the computational resources each pipeline utilizes.
select p.pipeline_id, p.name, p.cluster_id, c.cluster_name, c.cluster_source, c.cluster_cores, c.cluster_memory_mb, c.runtime_engine, c.account_idfrom databricks_pipeline p, databricks_compute_cluster cwhere p.cluster_id = c.cluster_id and p.account_id = c.account_id;
select p.pipeline_id, p.name, p.cluster_id, c.cluster_name, c.cluster_source, c.cluster_cores, c.cluster_memory_mb, c.runtime_engine, c.account_idfrom databricks_pipeline p, databricks_compute_cluster cwhere p.cluster_id = c.cluster_id and p.account_id = c.account_id;
Get the last completed pipeline update for each pipeline
This query allows you to gain insights into the most recent updates completed for each pipeline in your Databricks environment. It can be useful for monitoring pipeline performance, identifying potential issues, and understanding the timeline of your data processing tasks.
select p.pipeline_id, p.name, p.state, u ->> 'creation_time' as update_creation_time, u ->> 'state' as update_state, u ->> 'update_id' as update_id, account_idfrom databricks_pipeline p, jsonb_array_elements(p.latest_updates) as uwhere u ->> 'state' = 'COMPLETED'order by update_creation_time desclimit 1;
select p.pipeline_id, p.name, p.state, json_extract(u.value, '$.creation_time') as update_creation_time, json_extract(u.value, '$.state') as update_state, json_extract(u.value, '$.update_id') as update_id, p.account_idfrom databricks_pipeline p, json_each(p.latest_updates) as uwhere json_extract(u.value, '$.state') = 'COMPLETED'order by update_creation_time desclimit 1;
Get the last failed pipeline update for each pipeline
This example demonstrates how to pinpoint the most recent pipeline update that failed. This can help in troubleshooting the issues that led to the failure, thus improving the reliability and efficiency of your pipelines.
select p.pipeline_id, p.name, p.state, u ->> 'creation_time' as update_creation_time, u ->> 'state' as update_state, u ->> 'update_id' as update_id, account_idfrom databricks_pipeline p, jsonb_array_elements(p.latest_updates) as uwhere u ->> 'state' = 'FAILED'order by update_creation_time desclimit 1;
select p.pipeline_id, p.name, p.state, json_extract(u.value, '$.creation_time') as update_creation_time, json_extract(u.value, '$.state') as update_state, json_extract(u.value, '$.update_id') as update_id, account_idfrom databricks_pipeline p, json_each(p.latest_updates) as uwhere json_extract(u.value, '$.state') = 'FAILED'order by update_creation_time desclimit 1;
Get pipelines publishing data in a catalog table
Discover the segments that have active data pipelines publishing to a catalog in Databricks. This is useful for identifying and managing data flow across different pipelines in your account.
select pipeline_id, name, cluster_id, catalog, target, state, account_idfrom databricks_pipelinewhere catalog is not null;
select pipeline_id, name, cluster_id, catalog, target, state, account_idfrom databricks_pipelinewhere catalog is not null;
List pipelines that are manually triggered
Explore which pipelines in your Databricks environment are manually triggered, allowing you to understand which processes require user initiation and potentially optimize for automation.
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere not continuous;
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere continuous = 0;
List unhealthy pipelines
Discover the segments that consist of unhealthy pipelines in your Databricks environment. This allows for quick identification and troubleshooting of problematic pipelines, enhancing system performance and reliability.
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere health = 'UNHEALTHY';
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere health = 'UNHEALTHY';
List pipelines in development mode
Explore which pipelines are currently in development mode. This can be useful to identify the pipelines that are under development and not yet in production, allowing you to manage and track your development resources effectively.
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere development;
select pipeline_id, name, cluster_id, creator_user_name, state, account_idfrom databricks_pipelinewhere development = 1;
Get the permissions associated to each pipeline
Explore the different permissions assigned to each data pipeline to understand who has access and at what level. This can help manage security and access control within your Databricks environment.
select pipeline_id, name, acl ->> 'user_name' as principal_user_name, acl ->> 'group_name' as principal_group_name, acl ->> 'all_permissions' as permission_levelfrom databricks_pipeline, jsonb_array_elements(pipeline_permissions -> 'access_control_list') as acl;
select pipeline_id, name, json_extract(acl.value, '$.user_name') as principal_user_name, json_extract(acl.value, '$.group_name') as principal_group_name, json_extract(acl.value, '$.all_permissions') as permission_levelfrom databricks_pipeline, json_each( databricks_pipeline.pipeline_permissions, '$.access_control_list' ) as acl;
List libraries installed on each pipeline
Explore the various libraries installed on each pipeline in your Databricks environment. This can help you manage dependencies and understand the resources each pipeline is utilizing.
select pipeline_id, name, l -> 'notebook' ->> 'path' as notebook_path, l ->> 'maven' as maven, l ->> 'whl' as whl, l ->> 'jar' as jar, l -> 'file' ->> 'path' as file_path, account_idfrom databricks_pipeline, jsonb_array_elements(libraries) as l;
select pipeline_id, name, json_extract(l.value, '$.notebook.path') as notebook_path, json_extract(l.value, '$.maven') as maven, json_extract(l.value, '$.whl') as whl, json_extract(l.value, '$.jar') as jar, json_extract(l.value, '$.file.path') as file_path, account_idfrom databricks_pipeline, json_each(libraries) as l;
Get trigger settings for each pipeline
Analyze the settings to understand the trigger configurations for each data processing pipeline. This can help in assessing the frequency and type of triggers, which is crucial for managing and optimizing data workflows.
select pipeline_id, name, trigger ->> 'cron' as cron, trigger ->> 'manual' as is_manual, account_idfrom databricks_pipelinewhere trigger is not null;
select pipeline_id, name, json_extract(trigger, '$.cron') as cron, json_extract(trigger, '$.manual') as is_manual, account_idfrom databricks_pipelinewhere trigger is not null;
Get cluster settings for each pipeline
Explore the configuration of each pipeline to understand its associated settings, such as the type of node it uses, whether it has autoscale enabled, and the number of workers it employs. This information can be useful in optimizing pipeline performance and resource usage.
select pipeline_id, name, c ->> 'instance_pool_id' as instance_pool_id, c ->> 'node_type_id' as node_type_id, c ->> 'autoscale' as autoscale, c ->> 'num_workers' as num_workers, c ->> 'policy_id' as policy_id, account_idfrom databricks_pipeline, jsonb_array_elements(clusters) as c;
select pipeline_id, name, json_extract(c.value, '$.instance_pool_id') as instance_pool_id, json_extract(c.value, '$.node_type_id') as node_type_id, json_extract(c.value, '$.autoscale') as autoscale, json_extract(c.value, '$.num_workers') as num_workers, json_extract(c.value, '$.policy_id') as policy_id, account_idfrom databricks_pipeline, json_each(clusters) as c;
Schema for databricks_pipeline
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form. | |
account_id | text | =, !=, ~~, ~~*, !~~, !~~* | The Databricks Account ID in which the resource is located. |
catalog | text | A catalog in Unity Catalog to publish data from this pipeline to. | |
cause | text | An optional message detailing the cause of the pipeline state. | |
channel | text | DLT Release Channel that specifies which version to use. | |
cluster_id | text | The unique identifier of the cluster running the pipeline. | |
clusters | jsonb | Cluster settings for this pipeline deployment. | |
configuration | jsonb | String-String configuration for this pipeline execution. | |
continuous | boolean | Whether the pipeline is continuous or triggered. | |
creator_user_name | text | The user who created the pipeline. | |
development | boolean | Whether the pipeline is in Development mode. | |
edition | text | Pipeline product edition. | |
filters | jsonb | Filters on which Pipeline packages to include in the deployed graph. | |
health | text | The health of the pipeline. | |
last_modified | text | The last time the pipeline settings were modified or created. | |
latest_updates | jsonb | Status of the latest updates for the pipeline. | |
libraries | jsonb | Libraries or code needed by this deployment. | |
name | text | The user-friendly name of the pipeline. | |
photon | boolean | Whether photon is enabled for this pipeline. | |
pipeline_id | text | = | Unique identifier of pipeline. |
pipeline_permissions | jsonb | Permissions for this pipeline. | |
run_as_user_name | text | The username that the pipeline runs as. | |
serverless | boolean | Whether serverless compute is enabled for this pipeline. | |
sp_connection_name | text | =, !=, ~~, ~~*, !~~, !~~* | Steampipe connection name. |
sp_ctx | jsonb | Steampipe context in JSON form. | |
state | text | The current state of the pipeline. | |
storage | text | DBFS root directory for storing checkpoints and tables. | |
target | text | Target schema (database) to add tables in this pipeline to. | |
title | text | The title of the resource. | |
trigger | jsonb | Which pipeline trigger to use. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks
You can pass the configuration to the command with the --config
argument:
steampipe_export_databricks --config '<your_config>' databricks_pipeline