Table: databricks_job_run - Query Databricks Job Runs using SQL
Databricks Job Runs are a resource within the Databricks service that allows users to execute a specific job on a schedule or on-demand. They provide detailed information about each execution, including the job ID, run ID, start time, and state. This helps users track the status and progress of their Databricks jobs.
Table Usage Guide
The databricks_job_run
table provides insights into Job Runs within Databricks. As a Data Engineer, you can explore specific details about each job run through this table, including the job ID, run ID, start time, and state. Utilize it to monitor the status and progress of your Databricks jobs, helping you to manage and optimize your data processing tasks.
Examples
Basic info
Explore which Databricks job runs have taken place, identifying instances such as who initiated them and how long they lasted. This can be particularly beneficial for auditing purposes or for understanding patterns in job run durations and frequencies.
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_run;
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_run;
Get total runs per job
Discover the frequency of each job execution to better understand the workload distribution. This could be useful in identifying heavily used jobs that may need optimization or resource allocation adjustments.
select job_id, count(*) as total_runsfrom databricks_job_rungroup by job_id;
select job_id, count(*) as total_runsfrom databricks_job_rungroup by job_id;
Get total runs per job per day
Explore the frequency of job runs on a daily basis to understand the workload distribution and identify any potential bottlenecks or high-activity periods. This can help in optimizing job scheduling and resource allocation.
select job_id, date_trunc('day', start_time) as day, count(*) as total_runsfrom databricks_job_rungroup by job_id, dayorder by day;
select job_id, date(start_time) as day, count(*) as total_runsfrom databricks_job_rungroup by job_id, dayorder by day;
Get the state of the last run for each job
Determine the status of the most recent job execution for each task. This can help you understand the success or failure of your latest tasks, enabling you to troubleshoot issues or optimize performance.
select job_id, run_id, run_name, attempt_number, state ->> 'state_message', state ->> 'life_cycle_state', state ->> 'result_state', account_idfrom databricks_job_runorder by attempt_number desclimit 1;
select job_id, run_id, run_name, attempt_number, json_extract(state, '$.state_message'), json_extract(state, '$.life_cycle_state'), json_extract(state, '$.result_state'), account_idfrom databricks_job_runorder by attempt_number desclimit 1;
Get task details for each job run
This query is used to gain insights into the details of each task within job runs, including the duration, start and end times, and state. It's useful for understanding the performance and efficiency of tasks within job runs in a Databricks environment.
select job_id, run_id, run_name, t ->> 'task_key' as task_key, t ->> 'cleanup_duration' as cleanup_duration, t ->> 'cluster_instance' as cluster_instance, t ->> 'start_time' as start_time, t ->> 'end_time' as end_time, t ->> 'existing_cluster_id' as existing_cluster_id, t ->> 'notebook_task' as notebook_task, t ->> 'cleanup_duration' as cleanup_duration, t ->> 'state' as state, account_idfrom databricks_job_run, jsonb_array_elements(tasks) as twhere tasks is not null;
select job_id, run_id, run_name, json_extract(t.value, '$.task_key') as task_key, json_extract(t.value, '$.cleanup_duration') as cleanup_duration, json_extract(t.value, '$.cluster_instance') as cluster_instance, json_extract(t.value, '$.start_time') as start_time, json_extract(t.value, '$.end_time') as end_time, json_extract(t.value, '$.existing_cluster_id') as existing_cluster_id, json_extract(t.value, '$.notebook_task') as notebook_task, json_extract(t.value, '$.cleanup_duration') as cleanup_duration, json_extract(t.value, '$.state') as state, account_idfrom databricks_job_run, json_each(tasks) as twhere tasks is not null;
List jobs that are waiting for retry
Explore which jobs are currently in a waiting state for a retry attempt. This is useful for identifying potential issues with certain jobs and understanding where intervention may be necessary.
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_runwhere state ->> 'life_cycle_state' = 'WAITING_FOR_RETRY';
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_runwhere json_extract(state, '$.life_cycle_state') = 'WAITING_FOR_RETRY';
List retry job runs for a particular job
Analyze the settings to understand the frequency and reasons for job reruns in a specific Databricks job. This can help to pinpoint areas of instability and inform decisions on system optimization.
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_runwhere job_id = '572473586420586' and original_attempt_run_id <> run_id;
select run_id, run_name, job_id, original_attempt_run_id, attempt_number, creator_user_name, run_duration as run_duration_ms, account_idfrom databricks_job_runwhere job_id = '572473586420586' and original_attempt_run_id <> run_id;
Schema for databricks_job_run
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | |
account_id | text | The Databricks Account ID in which the resource is located. | |
attempt_number | bigint | The sequence number of this run attempt for a triggered job run. | |
cleanup_duration | bigint | The time in milliseconds it took to terminate the cluster and clean up any associated artifacts. | |
cluster_instance | jsonb | The cluster instance that was used to run this job. | |
cluster_spec | jsonb | A snapshot of the job's cluster specification when this run was created. | |
continuous | jsonb | The continuous trigger that triggered this run. | |
creator_user_name | text | The user who created this job run. | |
end_time | timestamp with time zone | The time at which this run ended. | |
execution_duration | bigint | The time in milliseconds it took to execute the commands in the JAR or notebook. | |
git_source | jsonb | An optional specification for a remote repository containing the notebooks used by this job's notebook tasks. | |
job_clusters | jsonb | A list of job cluster specifications that can be shared and reused by tasks of this job. | |
job_id | bigint | = | The canonical identifier of the job that contains this run. |
job_parameters | jsonb | Job-level parameters used in the run. | |
number_in_job | bigint | A unique identifier for this job run. This is set to the same value as `run_id`. | |
original_attempt_run_id | bigint | If this run is a retry of a prior run attempt, this field contains the run_id of the original attempt. | |
overriding_parameters | jsonb | The parameters used for this run. | |
repair_history | jsonb | The repair history of this job run. | |
run_duration | bigint | The time in milliseconds it took the job run and all of its repairs to finish. | |
run_id | bigint | = | The canonical identifier of the run. |
run_name | text | An optional name for the run. | |
run_page_url | text | The URL to the detail page of the run. | |
run_type | text | = | The type of run. |
schedule | jsonb | The schedule that triggered this run. | |
setup_duration | bigint | The time in milliseconds it took to set up the cluster. | |
start_time | timestamp with time zone | The time at which this run started. | |
state | jsonb | The current state of the run. | |
tasks | jsonb | The list of tasks performed by the run. | |
title | text | The title of the resource. | |
trigger | text | The trigger that triggered this run. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks
You can pass the configuration to the command with the --config
argument:
steampipe_export_databricks --config '<your_config>' databricks_job_run