Table: aws_glue_crawler - Query AWS Glue Crawlers using SQL
The AWS Glue Crawler is a component of AWS Glue service that automates the extraction, transformation, and loading (ETL) process. It traverses your data stores, identifies data formats, and suggests schemas and transformations. This enables you to categorize, search, and query metadata across your AWS environment.
Table Usage Guide
The aws_glue_crawler
table in Steampipe provides you with information about crawlers within AWS Glue. This table allows you, as a DevOps engineer, to query crawler-specific details, including its role, database, schedule, classifiers, and associated metadata. You can utilize this table to gather insights on crawlers, such as their run frequency, the database they are associated with, their status, and more. The schema outlines the various attributes of the Glue crawler for you, including the crawler ARN, creation date, last run time, and associated tags.
Examples
Basic info
Determine the status and creation details of your AWS Glue crawlers to better understand their function and manage them effectively. This can be particularly useful for identifying any crawlers that may require attention or modification.
select name, state, database_name, creation_time, description, recrawl_behaviorfrom aws_glue_crawler;
select name, state, database_name, creation_time, description, recrawl_behaviorfrom aws_glue_crawler;
List running crawlers
Discover the segments that are currently operational within your AWS Glue Crawlers to understand which tasks are active and could be consuming resources. This could be useful for resource management and troubleshooting ongoing tasks.
select name, state, database_name, creation_time, description, recrawl_behaviorfrom aws_glue_crawlerwhere state = 'RUNNING';
select name, state, database_name, creation_time, description, recrawl_behaviorfrom aws_glue_crawlerwhere state = 'RUNNING';
Schema for aws_glue_crawler
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form. | |
account_id | text | =, !=, ~~, ~~*, !~~, !~~* | The AWS Account ID in which the resource is located. |
akas | jsonb | Array of globally unique identifier strings (also known as) for the resource. | |
arn | text | The ARN of the crawler. | |
classifiers | jsonb | A list of UTF-8 strings that specify the custom classifiers that are associated with the crawler. | |
configuration | jsonb | Crawler configuration information. | |
crawl_elapsed_time | bigint | If the crawler is running, contains the total time elapsed since the last crawl began. | |
crawler_lineage_settings | text | Specifies whether data lineage is enabled for the crawler. | |
crawler_security_configuration | text | The name of the SecurityConfiguration structure to be used by this crawler. | |
creation_time | timestamp with time zone | The time that the crawler was created. | |
database_name | text | The name of the database in which the crawler's output is stored. | |
description | text | A description of the crawler. | |
lake_formation_configuration | jsonb | Specifies whether the crawler should use Lake Formation credentials for the crawler instead of the IAM role credentials. | |
last_crawl | jsonb | The status of the last crawl, and potentially error information if an error occurred. | |
last_updated | timestamp with time zone | The time that the crawler was last updated. | |
name | text | = | The name of the crawler. |
partition | text | The AWS partition in which the resource is located (aws, aws-cn, or aws-us-gov). | |
recrawl_behavior | text | Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run. A value of CRAWL_EVERYTHING specifies crawling the entire dataset again. A value of CRAWL_NEW_FOLDERS_ONLY specifies crawling only folders that were added since the last crawler run. A value of CRAWL_EVENT_MODE specifies crawling only the changes identified by Amazon S3 events. | |
region | text | The AWS Region in which the resource is located. | |
role | text | The Amazon Resource Name (ARN) of an IAM role that's used to access customer resources, such as Amazon Simple Storage Service (Amazon S3) data. | |
schedule | jsonb | For scheduled crawlers, the schedule when the crawler runs. | |
schema_change_policy | jsonb | The policy that specifies update and delete behaviors for the crawler. | |
sp_connection_name | text | =, !=, ~~, ~~*, !~~, !~~* | Steampipe connection name. |
sp_ctx | jsonb | Steampipe context in JSON form. | |
state | text | Indicates whether the crawler is running or pending. | |
table_prefix | text | The prefix added to the names of tables that are created. | |
targets | jsonb | A collection of targets to crawl. | |
title | text | Title of the resource. | |
version | bigint | The version of the crawler. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- aws
You can pass the configuration to the command with the --config
argument:
steampipe_export_aws --config '<your_config>' aws_glue_crawler