Table: databricks_compute_cluster - Query Databricks Compute Clusters using SQL
Databricks Compute Clusters are a resource within the Databricks service that allows users to run data analytics workloads on scalable, optimized hardware. They provide a flexible and powerful environment for running a wide range of analytics tasks, from data processing to machine learning. Databricks Compute Clusters can be easily scaled up or down, depending on the computational requirements of the workload.
Table Usage Guide
The databricks_compute_cluster
table provides insights into Compute Clusters within Databricks. As a data engineer or data scientist, explore cluster-specific details through this table, including configuration, status, and associated metadata. Utilize it to uncover information about clusters, such as their current state, the hardware configuration, and the version of Databricks Runtime they are running.
Examples
Basic info
Discover the segments that allow you to identify and analyze the state and start time of compute clusters in Databricks, along with the user who created them. This can be beneficial in understanding usage patterns and managing resources effectively.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_cluster;
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_cluster;
List clusters having local disk encryption disabled
Explore which clusters lack local disk encryption, a crucial security feature. This query is useful for identifying potential security vulnerabilities within your system.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere not enable_local_disk_encryption;
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere enable_local_disk_encryption = 0;
List clusters that are running
Discover the segments that are actively running within your Databricks clusters. This can assist in managing resources and identifying potential areas for optimization or troubleshooting.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere state = 'RUNNING';
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere state = 'RUNNING';
Get autoscaling configuration for each cluster
Explore the autoscaling configuration of each computing cluster to understand the minimum and maximum number of workers assigned. This helps in assessing the scalability of your compute resources.
select cluster_id, cluster_name, creator_user_name, autoscale ->> 'min_workers' as min_workers, autoscale ->> 'max_workers' as max_workersfrom databricks_compute_cluster;
select cluster_id, cluster_name, creator_user_name, json_extract(autoscale, '$.min_workers') as min_workers, json_extract(autoscale, '$.max_workers') as max_workersfrom databricks_compute_cluster;
Get AWS attributes associated with each cluster
Explore which AWS attributes are associated with each Databricks compute cluster to optimize resource allocation and improve system availability. This can provide insights into how your clusters are configured and help identify potential areas for cost savings or performance improvements.
select cluster_id, cluster_name, creator_user_name, aws_attributes ->> 'availability' as availability, aws_attributes ->> 'ebs_volume_type' as ebs_volume_type, aws_attributes ->> 'ebs_volume_count' as ebs_volume_count, aws_attributes ->> 'ebs_volume_iops' as ebs_volume_iops, aws_attributes ->> 'ebs_volume_size' as ebs_volume_size, aws_attributes ->> 'ebs_volume_throughput' as ebs_volume_throughput, aws_attributes ->> 'first_on_demand' as first_on_demand, aws_attributes ->> 'instance_profile_arn' as instance_profile_arn, aws_attributes ->> 'spot_bid_price_percent' as spot_bid_price_percent, aws_attributes ->> 'zone_id' as zone_idfrom databricks_compute_clusterwhere aws_attributes is not null;
select cluster_id, cluster_name, creator_user_name, json_extract(aws_attributes, '$.availability') as availability, json_extract(aws_attributes, '$.ebs_volume_type') as ebs_volume_type, json_extract(aws_attributes, '$.ebs_volume_count') as ebs_volume_count, json_extract(aws_attributes, '$.ebs_volume_iops') as ebs_volume_iops, json_extract(aws_attributes, '$.ebs_volume_size') as ebs_volume_size, json_extract(aws_attributes, '$.ebs_volume_throughput') as ebs_volume_throughput, json_extract(aws_attributes, '$.first_on_demand') as first_on_demand, json_extract(aws_attributes, '$.instance_profile_arn') as instance_profile_arn, json_extract(aws_attributes, '$.spot_bid_price_percent') as spot_bid_price_percent, json_extract(aws_attributes, '$.zone_id') as zone_idfrom databricks_compute_clusterwhere aws_attributes is not null;
Get Azure attributes associated with each cluster
This query is useful for gaining insights into the Azure attributes linked with each cluster in your Databricks compute environment. It helps in understanding the availability, on-demand status, log analytics information, and maximum bid price for spot instances, assisting in better resource management and cost optimization.
select cluster_id, cluster_name, creator_user_name, azure_attributes ->> 'availability' as availability, azure_attributes ->> 'first_on_demand' as first_on_demand, azure_attributes -> 'log_analytics_info' ->> 'log_analytics_primary_key' as log_analytics_primary_key, azure_attributes -> 'log_analytics_info' ->> 'log_analytics_workspace_id' as log_analytics_workspace_id, azure_attributes ->> 'spot_bid_max_price' as spot_bid_max_pricefrom databricks_compute_clusterwhere azure_attributes is not null;
select cluster_id, cluster_name, creator_user_name, json_extract(azure_attributes, '$.availability') as availability, json_extract(azure_attributes, '$.first_on_demand') as first_on_demand, json_extract( azure_attributes, '$.log_analytics_info.log_analytics_primary_key' ) as log_analytics_primary_key, json_extract( azure_attributes, '$.log_analytics_info.log_analytics_workspace_id' ) as log_analytics_workspace_id, json_extract(azure_attributes, '$.spot_bid_max_price') as spot_bid_max_pricefrom databricks_compute_clusterwhere azure_attributes is not null;
Get GCP attributes associated with each cluster
Explore the specifics of each cluster in your Google Cloud Platform (GCP) to understand its availability, boot disk size, associated Google service account, and local SSD count. This is useful for managing resources and ensuring optimal configuration for your cloud operations.
select cluster_id, cluster_name, creator_user_name, gcp_attributes ->> 'availability' as availability, gcp_attributes ->> 'boot_disk_size' as boot_disk_size, gcp_attributes -> 'google_service_account' as google_service_account, gcp_attributes -> 'local_ssd_count' as local_ssd_countfrom databricks_compute_clusterwhere gcp_attributes is not null;
select cluster_id, cluster_name, creator_user_name, json_extract(gcp_attributes, '$.availability') as availability, json_extract(gcp_attributes, '$.boot_disk_size') as boot_disk_size, json_extract(gcp_attributes, '$.google_service_account') as google_service_account, json_extract(gcp_attributes, '$.local_ssd_count') as local_ssd_countfrom databricks_compute_clusterwhere gcp_attributes is not null;
List clusters terminated due to inactivity
Determine the areas in which clusters have been terminated due to inactivity. This is useful to manage resources efficiently and avoid unnecessary costs associated with idle clusters.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere state = 'TERMINATED' and termination_reason ->> 'code' = 'INACTIVITY';
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere state = 'TERMINATED' and json_extract(termination_reason, '$.code') = 'INACTIVITY';
Get the permissions associated to each cluster
Explore the access levels of different users and groups across each computing cluster. This can be useful for auditing security measures and ensuring appropriate access rights are in place.
select cluster_id, cluster_name, acl ->> 'user_name' as principal_user_name, acl ->> 'group_name' as principal_group_name, acl ->> 'all_permissions' as permission_levelfrom databricks_compute_cluster, jsonb_array_elements(cluster_permissions -> 'access_control_list') as acl;
select cluster_id, cluster_name, json_extract(acl.value, '$.user_name') as principal_user_name, json_extract(acl.value, '$.group_name') as principal_group_name, json_extract(acl.value, '$.all_permissions') as permission_levelfrom databricks_compute_cluster, json_each(cluster_permissions, '$.access_control_list') as acl;
List clusters having elastic disks enabled
Determine the areas in which elastic disks are enabled within your clusters. This can help optimize storage usage and enhance the performance of your data operations.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere enable_elastic_disk;
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere enable_elastic_disk = 1;
List clusters that will automatically terminate within 1 hour of inactivity
Explore clusters set to automatically terminate after an hour of inactivity, which can be useful for managing resources and preventing unnecessary costs. This query helps in identifying such clusters and can guide decisions on resource allocation and usage.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere autotermination_minutes < 60;
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere autotermination_minutes < 60;
List clusters created by the Databricks Jobs Scheduler
Explore which clusters have been created by the Databricks Jobs Scheduler. This can help in understanding the distribution of resources and determining if there are any idle or underutilized clusters, enabling more efficient resource management.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere cluster_source = 'JOB';
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere cluster_source = 'JOB';
List clusters that can be accessed by a single user
Determine the areas in which a single user has access to multiple clusters. This aids in understanding the distribution of resources and permissions across different users in a Databricks environment.
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere data_security_mode = 'SINGLE_USER';
select cluster_id, cluster_name, creator_user_name, start_time, state, account_idfrom databricks_compute_clusterwhere data_security_mode = 'SINGLE_USER';
Schema for databricks_compute_cluster
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | |
account_id | text | The Databricks Account ID in which the resource is located. | |
autoscale | jsonb | Parameters needed in order to automatically scale clusters up and down based on load. | |
autotermination_minutes | bigint | The number of minutes of inactivity after which Databricks automatically terminates this cluster. | |
aws_attributes | jsonb | Attributes related to clusters running on Amazon Web Services. | |
azure_attributes | jsonb | Attributes related to clusters running on Microsoft Azure. | |
cluster_cores | double precision | Number of CPU cores available for this cluster. | |
cluster_id | text | = | Canonical identifier for the cluster. |
cluster_log_conf | jsonb | The configuration for delivering spark logs to a long-term storage destination. | |
cluster_log_status | jsonb | The status of the cluster log delivery. | |
cluster_memory_mb | double precision | Total amount of cluster memory, in megabytes. | |
cluster_name | text | Cluster name requested by the user. | |
cluster_permissions | jsonb | The permissions that the cluster has on the workspace. | |
cluster_source | text | Determines whether the cluster was created by a user through the UI, created by the Databricks Jobs Scheduler, or through an API request. | |
creator_user_name | text | Creator user name. The field won't be included if the user has already been deleted. | |
custom_tags | jsonb | Additional tags for cluster resources. | |
data_security_mode | text | The data security level of the cluster. | |
default_tags | jsonb | Tags that are added by Databricks regardless of any `custom_tags`. | |
docker_image | jsonb | The Docker image to use for every container in the cluster. | |
driver | jsonb | Node on which the Spark driver resides. | |
driver_instance_pool_id | text | The optional ID of the instance pool for the driver of the cluster belongs. | |
driver_node_type_id | text | The node type of the Spark driver. | |
enable_elastic_disk | boolean | Autoscaling Local Storage: when enabled, this cluster will dynamically acquire additional disk space when its Spark workers are running low on disk space. | |
enable_local_disk_encryption | boolean | Whether to enable LUKS on cluster VMs' local disks. | |
executors | jsonb | Nodes on which the Spark executors reside. | |
gcp_attributes | jsonb | Attributes related to clusters running on Google Cloud Platform. | |
init_scripts | jsonb | The configuration for storing init scripts. | |
instance_pool_id | text | The optional ID of the instance pool to which the cluster belongs. | |
jdbc_port | bigint | Port on which Spark JDBC server is listening, in the driver node. | |
last_restarted_time | timestamp with time zone | The time when the cluster was started/restarted. | |
last_state_loss_time | timestamp with time zone | Time when the cluster driver last lost its state (due to a restart or driver failure). | |
node_type_id | text | This field encodes, through a single value, the resources available to each of the Spark nodes in this cluster. | |
num_workers | bigint | Number of worker nodes that this cluster should have. | |
policy_id | text | The ID of the cluster policy used to create the cluster if applicable. | |
runtime_engine | text | Decides which runtime engine to be use. | |
single_user_name | text | Single user name if data_security_mode is `SINGLE_USER` | |
spark_conf | jsonb | An object containing a set of optional, user-specified Spark configuration key-value pairs. | |
spark_context_id | text | A canonical SparkContext identifier. | |
spark_env_vars | jsonb | An object containing a set of optional, user-specified Spark environment variables. | |
spark_version | text | The Spark version of the cluster. | |
ssh_public_keys | jsonb | SSH public key contents that will be added to each Spark node in this cluster. | |
start_time | timestamp with time zone | The time when the cluster creation request was received. | |
state | text | The current state of the cluster. | |
state_message | text | The message associated with the most recent state transition. | |
terminated_time | timestamp with time zone | The time when the cluster was terminated. | |
termination_reason | jsonb | The reason why the cluster was terminated. | |
title | text | The title of the resource. | |
workload_type_client | jsonb | Defines what type of clients can use the cluster. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks
You can pass the configuration to the command with the --config
argument:
steampipe_export_databricks --config '<your_config>' databricks_compute_cluster