turbot/databricks
steampipe plugin install databricks

Table: databricks_pipeline - Query Databricks Pipelines using SQL

Databricks Pipelines are a set of tools within the Databricks platform that allows you to build, test, and deploy machine learning workflows. It provides a unified way to automate the end-to-end machine learning lifecycle, including feature extraction, model training and testing, and model deployment. Databricks Pipelines help you manage and streamline your machine learning workflows, ensuring reproducibility and facilitating collaboration.

Table Usage Guide

The databricks_pipeline table provides insights into Pipelines within Databricks. As a Data Scientist or Machine Learning Engineer, explore pipeline-specific details through this table, including configuration, status, and associated metadata. Utilize it to manage and streamline your machine learning workflows, ensuring reproducibility and facilitating collaboration.

Examples

Basic info

Explore which pipelines are active in your Databricks environment, identifying their associated cluster and creator. This can help in managing resources and understanding user activity.

select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
edition,
account_id
from
databricks_pipeline;
select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
edition,
account_id
from
databricks_pipeline;

List pipelines that failed to start

Identify instances where certain pipelines have failed to initiate. This provides insights into potential issues within your system, allowing you to troubleshoot and resolve these problems effectively.

select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
state = 'FAILED';
select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
state = 'FAILED';

Get cluster details associated with each pipeline

Analyze the settings to understand the relationship between different pipelines and their associated clusters in Databricks. This can help in resource allocation and optimization by providing insights into the computational resources each pipeline utilizes.

select
p.pipeline_id,
p.name,
p.cluster_id,
c.cluster_name,
c.cluster_source,
c.cluster_cores,
c.cluster_memory_mb,
c.runtime_engine,
c.account_id
from
databricks_pipeline p,
databricks_compute_cluster c
where
p.cluster_id = c.cluster_id
and p.account_id = c.account_id;
select
p.pipeline_id,
p.name,
p.cluster_id,
c.cluster_name,
c.cluster_source,
c.cluster_cores,
c.cluster_memory_mb,
c.runtime_engine,
c.account_id
from
databricks_pipeline p,
databricks_compute_cluster c
where
p.cluster_id = c.cluster_id
and p.account_id = c.account_id;

Get the last completed pipeline update for each pipeline

This query allows you to gain insights into the most recent updates completed for each pipeline in your Databricks environment. It can be useful for monitoring pipeline performance, identifying potential issues, and understanding the timeline of your data processing tasks.

select
p.pipeline_id,
p.name,
p.state,
u ->> 'creation_time' as update_creation_time,
u ->> 'state' as update_state,
u ->> 'update_id' as update_id,
account_id
from
databricks_pipeline p,
jsonb_array_elements(p.latest_updates) as u
where
u ->> 'state' = 'COMPLETED'
order by
update_creation_time desc
limit
1;
select
p.pipeline_id,
p.name,
p.state,
json_extract(u.value, '$.creation_time') as update_creation_time,
json_extract(u.value, '$.state') as update_state,
json_extract(u.value, '$.update_id') as update_id,
p.account_id
from
databricks_pipeline p,
json_each(p.latest_updates) as u
where
json_extract(u.value, '$.state') = 'COMPLETED'
order by
update_creation_time desc
limit
1;

Get the last failed pipeline update for each pipeline

This example demonstrates how to pinpoint the most recent pipeline update that failed. This can help in troubleshooting the issues that led to the failure, thus improving the reliability and efficiency of your pipelines.

select
p.pipeline_id,
p.name,
p.state,
u ->> 'creation_time' as update_creation_time,
u ->> 'state' as update_state,
u ->> 'update_id' as update_id,
account_id
from
databricks_pipeline p,
jsonb_array_elements(p.latest_updates) as u
where
u ->> 'state' = 'FAILED'
order by
update_creation_time desc
limit
1;
select
p.pipeline_id,
p.name,
p.state,
json_extract(u.value, '$.creation_time') as update_creation_time,
json_extract(u.value, '$.state') as update_state,
json_extract(u.value, '$.update_id') as update_id,
account_id
from
databricks_pipeline p,
json_each(p.latest_updates) as u
where
json_extract(u.value, '$.state') = 'FAILED'
order by
update_creation_time desc
limit
1;

Get pipelines publishing data in a catalog table

Discover the segments that have active data pipelines publishing to a catalog in Databricks. This is useful for identifying and managing data flow across different pipelines in your account.

select
pipeline_id,
name,
cluster_id,
catalog,
target,
state,
account_id
from
databricks_pipeline
where
catalog is not null;
select
pipeline_id,
name,
cluster_id,
catalog,
target,
state,
account_id
from
databricks_pipeline
where
catalog is not null;

List pipelines that are manually triggered

Explore which pipelines in your Databricks environment are manually triggered, allowing you to understand which processes require user initiation and potentially optimize for automation.

select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
not continuous;
select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
continuous = 0;

List unhealthy pipelines

Discover the segments that consist of unhealthy pipelines in your Databricks environment. This allows for quick identification and troubleshooting of problematic pipelines, enhancing system performance and reliability.

select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
health = 'UNHEALTHY';
select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
health = 'UNHEALTHY';

List pipelines in development mode

Explore which pipelines are currently in development mode. This can be useful to identify the pipelines that are under development and not yet in production, allowing you to manage and track your development resources effectively.

select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
development;
select
pipeline_id,
name,
cluster_id,
creator_user_name,
state,
account_id
from
databricks_pipeline
where
development = 1;

Get the permissions associated to each pipeline

Explore the different permissions assigned to each data pipeline to understand who has access and at what level. This can help manage security and access control within your Databricks environment.

select
pipeline_id,
name,
acl ->> 'user_name' as principal_user_name,
acl ->> 'group_name' as principal_group_name,
acl ->> 'all_permissions' as permission_level
from
databricks_pipeline,
jsonb_array_elements(pipeline_permissions -> 'access_control_list') as acl;
select
pipeline_id,
name,
json_extract(acl.value, '$.user_name') as principal_user_name,
json_extract(acl.value, '$.group_name') as principal_group_name,
json_extract(acl.value, '$.all_permissions') as permission_level
from
databricks_pipeline,
json_each(
databricks_pipeline.pipeline_permissions,
'$.access_control_list'
) as acl;

List libraries installed on each pipeline

Explore the various libraries installed on each pipeline in your Databricks environment. This can help you manage dependencies and understand the resources each pipeline is utilizing.

select
pipeline_id,
name,
l -> 'notebook' ->> 'path' as notebook_path,
l ->> 'maven' as maven,
l ->> 'whl' as whl,
l ->> 'jar' as jar,
l -> 'file' ->> 'path' as file_path,
account_id
from
databricks_pipeline,
jsonb_array_elements(libraries) as l;
select
pipeline_id,
name,
json_extract(l.value, '$.notebook.path') as notebook_path,
json_extract(l.value, '$.maven') as maven,
json_extract(l.value, '$.whl') as whl,
json_extract(l.value, '$.jar') as jar,
json_extract(l.value, '$.file.path') as file_path,
account_id
from
databricks_pipeline,
json_each(libraries) as l;

Get trigger settings for each pipeline

Analyze the settings to understand the trigger configurations for each data processing pipeline. This can help in assessing the frequency and type of triggers, which is crucial for managing and optimizing data workflows.

select
pipeline_id,
name,
trigger ->> 'cron' as cron,
trigger ->> 'manual' as is_manual,
account_id
from
databricks_pipeline
where
trigger is not null;
select
pipeline_id,
name,
json_extract(trigger, '$.cron') as cron,
json_extract(trigger, '$.manual') as is_manual,
account_id
from
databricks_pipeline
where
trigger is not null;

Get cluster settings for each pipeline

Explore the configuration of each pipeline to understand its associated settings, such as the type of node it uses, whether it has autoscale enabled, and the number of workers it employs. This information can be useful in optimizing pipeline performance and resource usage.

select
pipeline_id,
name,
c ->> 'instance_pool_id' as instance_pool_id,
c ->> 'node_type_id' as node_type_id,
c ->> 'autoscale' as autoscale,
c ->> 'num_workers' as num_workers,
c ->> 'policy_id' as policy_id,
account_id
from
databricks_pipeline,
jsonb_array_elements(clusters) as c;
select
pipeline_id,
name,
json_extract(c.value, '$.instance_pool_id') as instance_pool_id,
json_extract(c.value, '$.node_type_id') as node_type_id,
json_extract(c.value, '$.autoscale') as autoscale,
json_extract(c.value, '$.num_workers') as num_workers,
json_extract(c.value, '$.policy_id') as policy_id,
account_id
from
databricks_pipeline,
json_each(clusters) as c;

Schema for databricks_pipeline

NameTypeOperatorsDescription
_ctxjsonbSteampipe context in JSON form, e.g. connection_name.
account_idtextThe Databricks Account ID in which the resource is located.
catalogtextA catalog in Unity Catalog to publish data from this pipeline to.
causetextAn optional message detailing the cause of the pipeline state.
channeltextDLT Release Channel that specifies which version to use.
cluster_idtextThe unique identifier of the cluster running the pipeline.
clustersjsonbCluster settings for this pipeline deployment.
configurationjsonbString-String configuration for this pipeline execution.
continuousbooleanWhether the pipeline is continuous or triggered.
creator_user_nametextThe user who created the pipeline.
developmentbooleanWhether the pipeline is in Development mode.
editiontextPipeline product edition.
filtersjsonbFilters on which Pipeline packages to include in the deployed graph.
healthtextThe health of the pipeline.
last_modifiedtextThe last time the pipeline settings were modified or created.
latest_updatesjsonbStatus of the latest updates for the pipeline.
librariesjsonbLibraries or code needed by this deployment.
nametextThe user-friendly name of the pipeline.
photonbooleanWhether photon is enabled for this pipeline.
pipeline_idtext=Unique identifier of pipeline.
pipeline_permissionsjsonbPermissions for this pipeline.
run_as_user_nametextThe username that the pipeline runs as.
serverlessbooleanWhether serverless compute is enabled for this pipeline.
statetextThe current state of the pipeline.
storagetextDBFS root directory for storing checkpoints and tables.
targettextTarget schema (database) to add tables in this pipeline to.
titletextThe title of the resource.
triggerjsonbWhich pipeline trigger to use.

Export

This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.

You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh script:

/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks

You can pass the configuration to the command with the --config argument:

steampipe_export_databricks --config '<your_config>' databricks_pipeline