Table: aws_emr_cluster - Query AWS Elastic MapReduce Cluster using SQL
The AWS Elastic MapReduce (EMR) Cluster is a web service that makes it easy to process large amounts of data efficiently. EMR uses Hadoop processing combined with several AWS products to do tasks such as web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing. Users can interactively analyze their data to achieve faster time-to-insights.
Table Usage Guide
The aws_emr_cluster
table in Steampipe provides you with information about clusters within AWS Elastic MapReduce (EMR). This table allows you as a data engineer to query cluster-specific details, including cluster status, hardware and software configurations, VPC settings, and associated metadata. You can utilize this table to gather insights on EMR clusters, such as cluster states, hardware and software configurations, and verification of VPC settings. The schema outlines the various attributes of the EMR cluster for you, including the cluster ID, name, status, normalized instance hours, and associated tags.
Examples
Basic info
Explore the status and termination settings of your AWS EMR clusters to manage resources effectively. This helps in identifying clusters that are in use and those that can be terminated to save costs.
select id, cluster_arn, name, auto_terminate, status ->> 'State' as state, tagsfrom aws_emr_cluster;
select id, cluster_arn, name, auto_terminate, json_extract(status, '$.State') as state, tagsfrom aws_emr_cluster;
List clusters with auto-termination disabled
Determine the areas in which clusters are operating with auto-termination disabled, which could potentially lead to unnecessary resource usage and increased costs.
select name, cluster_arn, auto_terminatefrom aws_emr_clusterwhere not auto_terminate;
select name, cluster_arn, auto_terminatefrom aws_emr_clusterwhere auto_terminate = 0;
List clusters which have terminated with errors
Identify instances where clusters have ended with errors. This allows you to pinpoint specific locations where issues have occurred, enabling efficient troubleshooting and problem resolution.
select id, name, status ->> 'State' as state, status -> 'StateChangeReason' ->> 'Message' as state_change_reasonfrom aws_emr_clusterwhere status ->> 'State' = 'TERMINATED_WITH_ERRORS';
select id, name, json_extract(status, '$.State') as state, json_extract( json_extract(status, '$.StateChangeReason'), '$.Message' ) as state_change_reasonfrom aws_emr_clusterwhere json_extract(status, '$.State') = 'TERMINATED_WITH_ERRORS';
Get application names and versions installed for each cluster
Determine the applications and their respective versions installed across different clusters. This is useful for tracking software versions and ensuring consistency across your cluster environment.
select name, cluster_arn, a ->> 'Name' as application_name, a ->> 'Version' as application_versionfrom aws_emr_cluster, jsonb_array_elements(applications) as a;
select name, cluster_arn, json_extract(a.value, '$.Name') as application_name, json_extract(a.value, '$.Version') as application_versionfrom aws_emr_cluster, json_each(applications) as a;
List clusters with logging disabled
Determine the areas in which logging is disabled in your clusters. This is useful for identifying potential gaps in your data tracking and ensuring comprehensive monitoring across all clusters.
select name, cluster_arn, log_urifrom aws_emr_clusterwhere log_uri is null
select name, cluster_arn, log_urifrom aws_emr_clusterwhere log_uri is null
List clusters with logging enabled but log encryption is disabled
Explore clusters where logging is activated but without the added security layer of log encryption. This can help identify potential vulnerabilities in your data security practices.
select name, cluster_arn, log_uri, log_encryption_kms_key_idfrom aws_emr_clusterwhere log_uri is not null and log_encryption_kms_key_id is null;
select name, cluster_arn, log_uri, log_encryption_kms_key_idfrom aws_emr_clusterwhere log_uri is not null and log_encryption_kms_key_id is null;
Query examples
- ec2_amis_for_emr_cluster
- emr_cluster_applications
- emr_cluster_auto_termination
- emr_cluster_by_account
- emr_cluster_by_region
- emr_cluster_by_state
- emr_cluster_count
- emr_cluster_ec2_instance_attributes
- emr_cluster_input
- emr_cluster_instance
- emr_cluster_log_encryption
- emr_cluster_logging
- emr_cluster_logging_disbaled_count
- emr_cluster_logging_encryption_disabled_count
- emr_cluster_overview
- emr_cluster_state
- emr_cluster_tags
- emr_cluster_termination_protection_disabled_count
- emr_clusters_for_iam_role
- iam_roles_for_emr_cluster
- s3_buckets_for_emr_cluster
Control examples
- All Controls > EMR > EMR cluster local disks should be encrypted with CMK
- All Controls > EMR > EMR clusters client side encryption (CSE CMK) enabled with CMK
- All Controls > EMR > EMR clusters encryption at rest should be enabled
- All Controls > EMR > EMR clusters encryption in transit should be enabled
- All Controls > EMR > EMR clusters local disk encryption should be enabled
- All Controls > EMR > EMR clusters server side encryption (SSE KMS) enabled with KMS
- All Controls > EMR > EMR clusters should have security configuration enabled
- AWS Foundational Security Best Practices > EMR > 1 Amazon EMR cluster primary nodes should not have public IP addresses
- EMR cluster Kerberos should be enabled
- EMR cluster master nodes should not have public IP addresses
Schema for aws_emr_cluster
Name | Type | Operators | Description |
---|---|---|---|
_ctx | jsonb | Steampipe context in JSON form. | |
account_id | text | =, !=, ~~, ~~*, !~~, !~~* | The AWS Account ID in which the resource is located. |
akas | jsonb | Array of globally unique identifier strings (also known as) for the resource. | |
applications | jsonb | The applications installed on this cluster. | |
auto_scaling_role | text | An IAM role for automatic scaling policies. | |
auto_terminate | boolean | Specifies whether the cluster should terminate after completing all steps. | |
cluster_arn | text | The Amazon Resource Name of the cluster. | |
configurations | jsonb | Applies only to Amazon EMR releases 4.x and later. The list of Configurations supplied to the EMR cluster. | |
custom_ami_id | text | Available only in Amazon EMR version 5.7.0 and later. The ID of a custom Amazon EBS-backed Linux AMI if the cluster uses a custom AMI. | |
ebs_root_volume_iops | bigint | The IOPS, of the Amazon EBS root device volume of the Linux AMI that is used for each Amazon EC2 instance. | |
ebs_root_volume_size | text | The size of the Amazon EBS root device volume of the Linux AMI that is used for each EC2 instance, in GiB. Available in Amazon EMR version 4.x and later. | |
ebs_root_volume_throughput | bigint | The throughput, in MiB/s, of the Amazon EBS root device volume of the Linux AMI that is used for each Amazon EC2 instance. | |
ec2_instance_attributes | jsonb | Provides information about the EC2 instances in a cluster grouped by category. | |
id | text | = | The unique identifier for the cluster. |
instance_collection_type | text | The instance group configuration of the cluster. | |
kerberos_attributes | jsonb | Attributes for Kerberos configuration when Kerberos authentication is enabled using a security configuration. | |
log_encryption_kms_key_id | text | The AWS KMS customer master key (CMK) used for encrypting log files. This attribute is only available with EMR version 5.30.0 and later, excluding EMR 6.0.0. | |
log_uri | text | The path to the Amazon S3 location where logs for this cluster are stored. | |
master_public_dns_name | text | The DNS name of the master node. | |
name | text | The name of the cluster. | |
normalized_instance_hours | bigint | An approximation of the cost of the cluster, represented in m1.small/hours. | |
os_release_label | text | The Amazon Linux release specified in a cluster launch RunJobFlow request. | |
outpost_arn | text | The Amazon Resource Name (ARN) of the Outpost where the cluster is launched. | |
partition | text | The AWS partition in which the resource is located (aws, aws-cn, or aws-us-gov). | |
placement_groups | jsonb | Placement group configured for an Amazon EMR cluster. | |
region | text | The AWS Region in which the resource is located. | |
release_label | text | The Amazon EMR release label, which determines the version of open-source application packages installed on the cluster. | |
repo_upgrade_on_boot | text | Applies only when CustomAmiID is used. Specifies the type of updates that are applied from the Amazon Linux AMI package repositories when an instance boots using the AMI. | |
requested_ami_version | text | Applies only when CustomAmiID is used. Specifies the type of updates that are applied from the Amazon Linux AMI package repositories when an instance boots using the AMI. | |
running_ami_version | text | The AMI version running on this cluster. | |
scale_down_behavior | text | The way that individual Amazon EC2 instances terminate when an automatic scale-in activity occurs or an instance group is resized. | |
security_configuration | text | The name of the security configuration applied to the cluster. | |
service_role | text | The IAM role that will be assumed by the Amazon EMR service to access AWS resources on your behalf. | |
sp_connection_name | text | =, !=, ~~, ~~*, !~~, !~~* | Steampipe connection name. |
sp_ctx | jsonb | Steampipe context in JSON form. | |
state | text | = | The current state of the cluster. |
status | jsonb | The current status details about the cluster. | |
step_concurrency_level | bigint | Specifies the number of steps that can be executed concurrently. | |
tags | jsonb | A map of tags for the resource. | |
tags_src | jsonb | A list of tags associated with a cluster. | |
termination_protected | boolean | Indicates whether Amazon EMR will lock the cluster to prevent the EC2 instances from being terminated by an API call or user intervention, or in the event of a cluster error. | |
title | text | Title of the resource. | |
unhealthy_node_replacement | boolean | Indicates whether Amazon EMR should gracefully replace Amazon EC2 core instances that have degraded within the cluster. | |
visible_to_all_users | boolean | Indicates whether the cluster is visible to all IAM users of the AWS account associated with the cluster. |
Export
This table is available as a standalone Exporter CLI. Steampipe exporters are stand-alone binaries that allow you to extract data using Steampipe plugins without a database.
You can download the tarball for your platform from the Releases page, but it is simplest to install them with the steampipe_export_installer.sh
script:
/bin/sh -c "$(curl -fsSL https://steampipe.io/install/export.sh)" -- aws
You can pass the configuration to the command with the --config
argument:
steampipe_export_aws --config '<your_config>' aws_emr_cluster