Table: databricks_files_dbfs - Query Databricks DBFS Files using SQL
Databricks DBFS (Databricks File System) is a distributed file system installed on Databricks clusters. It allows users to interact with object storage via standard file system operations, and is crucial for storing all types of data such as ETL outputs, machine learning models, etc. DBFS provides the interface to access and manage data across all Databricks workspaces and to persist objects across cluster lifetimes.
Table Usage Guide
The databricks_files_dbfs
table provides insights into DBFS Files within Databricks. As a data scientist or data engineer, explore file-specific details through this table, including file paths, sizes, and types. Utilize it to manage and organize your data in Databricks, ensuring efficient data processing and analytics.
Examples
Basic info
Explore the basic information of files stored in Databricks, including file size, modification time, and content. This can be particularly beneficial for understanding the file structure, tracking changes, and managing storage effectively.
select path, file_size, is_dir, modification_time, contentfrom databricks_files_dbfswhere path_prefix = '/';
select path, file_size, is_dir, modification_time, contentfrom databricks_files_dbfswhere path_prefix = '/';
List all the directories in DBFS
Explore all directories in DBFS to gain insights into their modification times, which can be useful for understanding file system changes and data modifications.
select path, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and is_dir;
select path, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and is_dir = 1;
List all the files in DBFS
Explore which files are stored in your DBFS by assessing their paths, sizes, and modification times. This could be useful in instances where you need to manage your storage space or track changes to files over time.
select path, file_size, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and not is_dir;
select path, file_size, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and not is_dir;
List all the files in DBFS that are larger than 1MB
Explore which files in your Databricks File System (DBFS) are larger than 1MB. This can be useful for managing your storage and identifying files that might be taking up more space than necessary.
select path, file_size, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and not is_dir and file_size > 1000000;
select path, file_size, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and not is_dir and file_size > 1000000;
List all the files in DBFS that were modified in the past 7 days
Discover the segments that have seen recent changes by pinpointing the specific locations where files have been modified in the past week. This allows you to keep track of updates and changes, ensuring you're always working with the most recent data.
select path, file_size, is_dir, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and modification_time > now() - interval '7' day;
select path, file_size, is_dir, modification_timefrom databricks_files_dbfswhere path_prefix = '/' and modification_time > datetime('now', '-7 day');
Get contents of a particular file/directory
Explore the contents of a specific file or directory to understand its size, modification time, and data. This can be useful for auditing file changes, monitoring data usage, or troubleshooting issues related to file content.
select path, file_size, modification_time, content ->> 'bytes_read' as bytes_read, content ->> 'data' as datafrom databricks_files_dbfswhere path = '/path/to/file/directory';
select path, file_size, modification_time, json_extract(content, '$.bytes_read') as bytes_read, json_extract(content, '$.data') as datafrom databricks_files_dbfswhere path = '/path/to/file/directory';
Schema for databricks_files_dbfs
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form, e.g. connection_name. | |
content | jsonb | The content of the file. | |
file_size | bigint | The length of the file in bytes or zero if the path is a directory. | |
is_dir | boolean | True if the path is a directory. | |
modification_time | timestamp with time zone | Last modification time of given file/dir in milliseconds since Epoch. | |
path | text | = | The path of the file or directory. |
path_prefix | text | = | The path prefix of the file or directory. |
title | text | The title of the resource. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- databricks
You can pass the configuration to the command with the --config
argument:
steampipe_export_databricks --config '<your_config>' databricks_files_dbfs