Practice questions for the AWS DEA-C01 (Data Engineer Associate) exam, Chapter 1.
-
Q1. A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.
Which solution will meet this requirement?
- A. Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint.
- B. Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket.
- C. Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name.
- D. Verify that the VPC's route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint.
View question →
-
Q2. A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To
Improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing
Analytics platform.
The company wants to minimize the effort and time required to incorporate third-party datasets.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use API calls to access and integrate third-party datasets from AWS Data Exchange.
- B. Use API calls to access and integrate third-party datasets from AWS DataSync.
- C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories.
- D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR).
View question →
-
Q3. A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the
Data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions.
The data engineer requires a less manual way to update the Lambda functions.
Which solution will meet this requirement?
- A. Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket.
- B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.
- C. Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket.
- D. Assign the same alias to each Lambda function. Call reach Lambda function by specifying the function's alias.
View question →
-
Q4. A company created an extract, transform, and load (ETL) data pipeline in AWS Glue. A data engineer must crawl a table that is in Microsoft
SQL Server. The data engineer needs to extract, transform, and load the output of the crawl to an Amazon S3 bucket. The data engineer also
Must orchestrate the data pipeline.
Which AWS service or feature will meet these requirements MOST cost-effectively?
- A. AWS Step Functions
- B. AWS Glue workflows
- C. AWS Glue Studio
- D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
View question →
-
Q5. A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Establish WebSocket connections to Amazon Redshift.
- B. Use the Amazon Redshift Data API.
- C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.
- D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.
View question →
-
Q6. A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company
Must implement permission controls to separate query processes and access to query history among users, teams, and applications that are
In the same AWS account.
Which solution will meet these requirements?
- A. Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply The S3 bucket policy to the S3 bucket.
- B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply Appropriate permissions to the workgroup.
- C. Create an IAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena.
- D. Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.
View question →
-
Q7. A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time.
Which solution will run the Glue jobs in the MOST cost-effective way?
- A. Choose the FLEX execution class in the Glue job properties.
- B. Use the Spot Instance type in Glue job properties.
- C. Choose the STANDARD execution class in the Glue job properties.
- D. Choose the latest version in the GlueVersion field in the Glue job properties.
View question →
-
Q8. A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
- B. Create an S3 event notification that has an event type of s3:ObjectTagging:* for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
- C. Create an S3 event notification that has an event type of s3:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
- D. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic.
View question →
-
Q9. A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
- A. Change the data format from .csv to JSON format. Apply Snappy compression.
- B. Compress the .csv files by using Snappy compression.
- C. Change the data format from .csv to Apache Parquet. Apply Snappy compression.
- D. Compress the .csv files by using gzip compression.
View question →
-
Q10. A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket.
The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
- A. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
- B. Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.
- C. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.
- D. Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
View question →
-
Q11. A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
- A. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket.
- B. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the Source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output.
- C. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the Source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for The output.
- D. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.
View question →
-
Q12. A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to
track which tables have been loaded and which tables still need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS
Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
- A. Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
- B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.
- C. Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
- D. Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.
View question →
-
Q13. A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.
Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?
- A. AWS DataSync
- B. AWS Glue
- C. AWS Direct Connect
- D. Amazon S3 Transfer Acceleration
View question →
-
Q14. A company uses an on-premises Microsoft SQL Server database to store nancial transaction data. The company migrates the transaction
Data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the onpremises
Database to an Amazon RDS for SQL Server database has increased recently.
The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications
That access the database.
Which AWS service should the company use to meet these requirements?
- A. AWS Lambda
- B. AWS Database Migration Service (AWS DMS)
- C. AWS Direct Connect
- D. AWS DataSync
View question →
-
Q15. A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A
Data engineer needs to identify a solution that will turn on concurrency scaling.
Which solution will meet this requirement?
- A. Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups.
- B. Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.
- C. Turn on concurrency scaling in the settings during the creation of any new Redshift cluster.
- D. Turn on concurrency scaling for the daily usage quota for the Redshift cluster.
View question →
-
Q16. A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants
To explore serverless options.
The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads
Process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.
Which extract, transform, and load (ETL) service will meet these requirements?
- A. AWS Glue
- B. Amazon EMR
- C. AWS Lambda
- D. Amazon Redshift
View question →
-
Q17. A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII.
Which solution will meet this requirement with the LEAST operational effort?
- A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream.
- B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to Orchestrate a data pipeline to ingest the data into the S3 data lake.
- C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
- D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table And to transform the data. Use the same Lambda function to ingest the data into the S3 data lake.
View question →
-
Q18. A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.
The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.
Which solution will meet these requirements with the LEAST operational overhead?
- A. AWS Glue workflows
- B. AWS Step Functions tasks
- C. AWS Lambda functions
- D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows
View question →
-
Q19. A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.
A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.
The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.
Which solution will meet these requirements in the MOST cost-effective way?
- A. A. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
- B. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
- C. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
- D. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
View question →
-
Q20. A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to
Support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business
Intelligence (BI) tasks.
The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis
Tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster.
The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The
Solution must minimize usage of the computing resources of the ETL cluster.
Which solution will meet these requirements?
- A. Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.
- B. Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
- C. Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
- D. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based On the content of the ETL cluster.
View question →
-
Q21. A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB,
Amazon RDS, Amazon Redshift, and Amazon S3.
Which solution will meet this requirement MOST cost-effectively?
- A. Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
- B. Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
- C. Use Amazon Athena Federated Query to join the data from all data sources.
- D. Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.
View question →
-
Q22. A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon
Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time
Insights by using existing business intelligence (BI) and analytics tools.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.
- B. Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the Materialized views regularly to query the most recent stream data.
- C. Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a Materialized view to read data from the stream. Set the materialized view to auto refresh.
- D. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.
View question →
-
Q23. A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.
Which Step Functions state should the data engineer use to meet these requirements?
- A. Parallel state
- B. Parallel state
- C. Choice state
- D. Map state
- E. Wait state
View question →
-
Q24. A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the
Legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas Library to perform data deduplication.
- B. Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data To perform data deduplication.
- C. Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform Data deduplication.
- D. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data Deduplication.
View question →
-
Q25. A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the
Website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results
To API Gateway.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS) cluster.
- B. Create an AWS Lambda Python function with provisioned concurrency.
- C. Deploy a custom Python script that can integrate with API Gateway on Amazon Elastic Kubernetes Service (Amazon EKS).
- D. Create an AWS Lambda function. Ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda Function every 5 minutes by using mock events.
View question →
-
Q26. A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store
And analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon
CloudWatch Logs.
The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.
Which solution will meet these requirements?
- A. Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account Permissions to Kinesis Data Streams in the production AWS account.
- B. Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.
- C. Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has crossaccount Permissions to Kinesis Data Streams in the security AWS account.
View question →
-
Q27. A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes.
A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake.
Which solution will capture the changed data MOST cost-effectively?
- A. Create an AWS Lambda function to identify the changes between the previous data and the current data. Configure the Lambda function to ingest the changes into the data lake.
- B. Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
- C. Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data
- D. Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
View question →
-
Q28. A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on
The incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is
Highly fault tolerant.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a Window of up to 30 minutes for the data in Amazon Kinesis Data Streams.
- B. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might Occasionally contain duplicates by using multiple types of aggregations.
- C. Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of Up to 30 minutes, based on the event timestamp.
- D. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using Multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.
View question →
-
Q29. A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company
Wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots. Attach the new gp3 volumes to the EC2 instances.
- B. Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When the transfer is complete, mount the new gp3 Volumes to the EC2 instances to replace the gp2 volumes.
- C. Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput.
- D. Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2 volumes to the new gp3 volumes.
View question →
-
Q30. A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL
Server DB instances. The company's analytics team must export large data elements every day until the migration is complete. The data
Elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the
Data in Amazon S3.
Which solution will meet these requirements in the MOST operationally ecient way?
- A. Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that Selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every Day.
- B. Schedule SQL Server Agent to run a daily SQL query that selects the desired data elements from the EC2 instance-based SQL Server Databases. Congure the query to direct the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWS Lambda function To transform the output format from .csv to Parquet.
- C. Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create And run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to An S3 bucket. Schedule the AWS Glue job to run every day.
- D. Create an AWS Lambda function that queries the EC2 instance-based databases by using Java Database Connectivity (JDBC). Congure The Lambda function to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.
View question →
-
Q31. A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from long- running queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues.
Which table views should the data engineer use to meet this requirement?
- A. STL_USAGE_CONTROL
- B. STL_ALERT_EVENT_LOG
- C. STL_QUERY_METRICS
- D. STL_PLAN_INFO
View question →
-
Q32. A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake. The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file.
Which solution will meet these requirements MOST cost-effectively?
- A. Use an AWS Glue PySpark job to ingest the source data into the data lake in .csv format.
- B. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to ingest the data into the data lake in JSON format.
- C. Use an AWS Glue PySpark job to ingest the source data into the data lake in Apache Avro format.
- D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.
View question →
-
Q33. A company is developing an application that runs on Amazon EC2 instances. Currently, the data that the application generates is temporary. However, the company needs to persist the data, even if the EC2 instances are terminated.
A data engineer must launch new EC2 instances from an Amazon Machine Image (AMI) and configure the instances to preserve the data.
Which solution will meet this requirement?
- A. Launch new EC2 instances by using an AMI that is backed by an EC2 instance store volume that contains the application data. Apply the default settings to the EC2 instances.
- B. Launch new EC2 instances by using an AMI that is backed by a root Amazon Elastic Block Store (Amazon EBS) volume that contains the application data. Apply the default settings to the EC2 instances
- C. Launch new EC2 instances by using an AMI that is backed by an EC2 instance store volume. Attach an Amazon Elastic Block Store (Amazon EBS) volume to contain the application data. Apply the default settings to the EC2 instances
- D. Launch new EC2 instances by using an AMI that is backed by an Amazon Elastic Block Store (Amazon EBS) volume. Attach an additional EC2 instance store volume to contain the application data. Apply the default settings to the EC2 instances.
View question →
-
Q34. A company uses Amazon Athena to run SQL queries for extract, transform, and load (ETL) tasks by using Create Table As Select (CTAS). The
Company must use Apache Spark instead of SQL to generate analytics.
Which solution will give the company the ability to use Spark to access Athena?
- A. Athena query settings
- B. Athena workgroup
- C. Athena data source
- D. Athena query editor
View question →
-
Q35. A company needs to partition the Amazon S3 storage that the company uses for a data lake. The partitioning will use a path of the S3 object keys in the following format: s3://bucket/prefix/year=2023/month=01/day=01.
A data engineer must ensure that the AWS Glue Data Catalog synchronizes with the S3 storage when the company adds new partitions to the bucket.
Which solution will meet these requirements with the LEAST latency?
- A. Schedule an AWS Glue crawler to run every morning.
- B. Manually run the AWS Glue CreatePartition API twice each day.
- C. Use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue create_partition API call.
- D. Run the MSCK REPAIR TABLE command from the AWS Glue console.
View question →
-
Q36. A media company uses software as a service (SaaS) applications to gather data by using third-party tools. The company needs to store the
Data in an Amazon S3 bucket. The company will use Amazon Redshift to perform analytics based on the data.
Which AWS service or feature will meet these requirements with the LEAST operational overhead?
- A. Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- B. Amazon AppFlow
- C. AWS Glue Data Catalog
- D. Amazon Kinesis
View question →
-
Q37. A data engineer is using Amazon Athena to analyze sales data that is in Amazon S3. The data engineer writes a query to retrieve sales
Amounts for 2023 for several products from a table named sales_data. However, the query does not return results for all of the products that
Are in the sales_data table. The data engineer needs to troubleshoot the query to resolve the issue.
The data engineer's original query is as follows:
SELECT product_name, sum(sales_amount)
FROM sales_data -
WHERE year = 2023 -
GROUP BY product_name -
How should the data engineer modify the Athena query to meet these requirements?
- A. Replace sum(sales_amount) with count(∗) for the aggregation.
- B. Change WHERE year = 2023 to WHERE extract(year FROM sales_data) = 2023.
- C. Add HAVING sum(sales_amount) > 0 after the GROUP BY clause.
- D. Remove the GROUP BY clause.
View question →
-
Q38. A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer
Needs to query only one column of the data.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Congure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe. Write a SQL SELECT statement on the Dataframe to query the required column.
- B. Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.
- C. Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.
- D. Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.
View question →
-
Q39. A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized
Views.
Which solution will meet this requirement with the LEAST effort?
- A. Use Apache Airflow to refresh the materialized views.
- B. Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh the materialized views.
- C. Use the query editor v2 in Amazon Redshift to refresh the materialized views.
- D. Use an AWS Glue workflow to refresh the materialized views.
View question →
-
Q40. A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must
Integrate with AWS services.
Which solution will meet these requirements with the LEAST management overhead?
- A. Use an AWS Step Functions workflow that includes a state machine. Configure the state machine to run the Lambda function and then the AWS Glue job.
- B. Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.
- C. Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job.
- D. Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.
View question →
-
Q41. A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.
The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon Aurora as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the Aurora data catalog. Schedule the Lambda functions to run periodically.
- B. Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to Update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog.
- C. Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the DynamoDB data catalog. Schedule the Lambda functions to run periodically.
- D. Use the AWS Glue Data Catalog as the central metadata repository. Extract the schema for Amazon RDS and Amazon Redshift sources, And build the Data Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema and to automatically update the Data Catalog.
View question →
-
Q42. A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the
Application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the
Morning. The application has very low usage during weekends.
The company must ensure that the application performs consistently during peak usage times.
Which solution will meet these requirements in the MOST cost-effective way?
- A. Increase the provisioned capacity to the maximum capacity that is currently present during peak load times.
- B. Divide the table into two tables. Provision each table with half of the provisioned capacity of the original table. Spread queries evenly Across both tables.
- C. Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times. Schedule lower capacity during offpeak Times.
- D. Change the capacity mode from provisioned to on-demand. Configure the table to scale up and scale down based on the load on the table.
View question →
-
Q43. A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog
Into a persistent storage solution.
The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a
Serverless solution to migrate the data catalog.
Which solution will meet these requirements MOST cost-effectively?
- A. Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the data catalog.
- B. Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company's data catalog as an external data catalog.
- C. Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company's data catalog.
- D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company's data catalog.
View question →
-
Q44. A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution.
A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations.
The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes.
Which solution will meet these requirements?
- A. Change the sort key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.
- B. Change the distribution key to the table column that has the largest dimension
- C. Upgrade the reserved node from ra3.4xlarge to ra3.16xlarge.
- D. Change the primary key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.
View question →
-
Q45. A security company stores IoT data that is in JSON format in an Amazon S3 bucket. The data structure can change when the company
Upgrades the IoT devices. The company wants to create a data catalog that includes the IoT data. The company's analytics department will
Use the data catalog to index the data.
Which solution will meet these requirements MOST cost-effectively?
- A. Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
- B. Create an Amazon Redshift provisioned cluster. Create an Amazon Redshift Spectrum database for the analytics department to explore The data that is in Amazon S3. Create Redshift stored procedures to load the data into Amazon Redshift.
- C. Create an Amazon Athena workgroup. Explore the data that is in Amazon S3 by using Apache Spark through Athena. Provide the Athena Workgroup schema and tables to the analytics department.
- D. Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API. Create an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
View question →
-
Q46. A company stores details about transactions in an Amazon S3 bucket. The company wants to log all writes to the S3 bucket into another S3
Bucket that is in the same AWS Region.
Which solution will meet this requirement with the LEAST operational effort?
- A. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket to invoke an AWS Lambda function. Program the Lambda function to write the event to Amazon Kinesis Data Firehose. Configure Kinesis Data Firehose to write the event to the logs S3 bucket.
- B. Create a trail of management events in AWS CloudTraiL. Configure the trail to receive data from the transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logs S3 bucket as the destination bucket.
- C. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket to invoke an AWS Lambda function. Program the Lambda function to write the events to the logs S3 bucket.
- D. Create a trail of data events in AWS CloudTraiL. Configure the trail to receive data from the transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logs S3 bucket as the destination bucket.
View question →
-
Q47. A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The
Repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer
Needs to import the metadata from Hive into the central metadata repository.
Which solution will meet these requirements with the LEAST development effort?
- A. Use Amazon EMR and Apache Ranger.
- B. Use a Hive metastore on an EMR cluster.
- C. Use the AWS Glue Data Catalog.
- D. Use a metastore on an Amazon RDS for MySQL DB instance.
View question →
-
Q48. A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access by rows and columns. Provide data access through Amazon S3.
- B. Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR to restrict data access by rows and columns. Provide Data access by using Apache Pig.
- C. Use Amazon Redshift for data lake storage. Use Redshift security policies to restrict data access by rows and columns. Provide data Access by using Apache Spark and Amazon Athena federated queries.
- D. Use Amazon S3 for data lake storage. Use AWS Lake Formation to restrict data access by rows and columns. Provide data access Through AWS Lake Formation
View question →
-
Q49. A company has used an Amazon Redshift table that is named Orders for 6 months. The company performs weekly updates and deletes on the
Table. The table has an interleaved sort key on a column that contains AWS Regions.
The company wants to reclaim disk space so that the company will not run out of storage space. The company also wants to analyze the sort
Key column.
Which Amazon Redshift command will meet these requirements?
- A. VACUUM FULL Orders
- B. VACUUM DELETE ONLY Orders
- C. VACUUM REINDEX Orders
- D. VACUUM SORT ONLY Orders
View question →
-
Q50. A manufacturing company wants to collect data from sensors. A data engineer needs to implement a solution that ingests sensor data in near
Real time.
The solution must store the data to a persistent data store. The solution must store the data in nested JSON format. The company must have
The ability to query from the data store with a latency of less than 10 milliseconds.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use a self-hosted Apache Kafka cluster to capture the sensor data. Store the data in Amazon S3 for querying.
- B. Use AWS Lambda to process the sensor data. Store the data in Amazon S3 for querying.
- C. Use Amazon Kinesis Data Streams to capture the sensor data. Store the data in Amazon DynamoDB for querying.
- D. Use Amazon Simple Queue Service (Amazon SQS) to buffer incoming sensor data. Use AWS Glue to store the data in Amazon RDS for Querying.
View question →
-
Q51. A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require.
Which solution will meet these requirements with the LEAST effort?
- A. Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company's IAM roles. Assign each user to the IAM role that matches the user's PII access requirements.
- B. Use Amazon QuickSight to access the data. Use column-level security features in QuickSight to limit the PII that users can retrieve from Amazon S3 by using Amazon Athena. Define QuickSight access levels based on the PII access requirements of the users.
- C. Build a custom query builder UI that will run Athena queries in the background to access the data. Create user groups in Amazon Cognito. Assign access levels to the user groups based on the PII access requirements of the users.
- D. Create IAM roles that have different levels of granular access. Assign the IAM roles to IAM user groups. Use an identity-based policy to Assign access levels to user groups at the column level.
View question →
-
Q52. A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies.
A data engineer wants to cost optimize the company's use of Amazon Athena without adding any additional infrastructure costs.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archive storage class after 1 day.
- B. Use the query result reuse feature of Amazon Athena for the SQL queries.
- C. Add an Amazon ElastiCache cluster between the BI application and Athena.
- D. Change the format of the files that are in the dataset to Apache Parquet.
View question →
-
Q53. A company's data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster.
The data engineer cannot increase the size of the cluster because of budget constraints.
The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of
Gigabytes in size. Other tables are less than 10 MB in size.
Which solution will meet these requirements?
- A. Keep using the EVEN distribution style for all tables. Specify primary and foreign keys for all tables.
- B. Use the ALL distribution style for large tables. Specify primary and foreign keys for all tables.
- C. Use the ALL distribution style for rarely updated small tables. Specify primary and foreign keys for all tables.
- D. Specify a combination of distribution, sort, and partition keys for all tables.
View question →
-
Q54. A company receives .csv files that contain physical address data. The data is in columns that have the following names: Door_No, Street_Name, City, and Zip_Code. The company wants to create a single column to store these values in the following format:
Which solution will meet this requirement with the LEAST coding effort?
- A. Use AWS Glue DataBrew to read the files. Use the NEST_TO_ARRAY transformation to create the new column.
- B. Use AWS Glue DataBrew to read the files. Use the NEST_TO_MAP transformation to create the new column.
- C. Use AWS Glue DataBrew to read the files. Use the PIVOT transformation to create the new column.
- D. Write a Lambda function in Python to read the files. Use the Python data dictionary type to create the new column.
View question →
-
Q55. A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by
Using encryption. The company must also use encryption keys that only specic employees can access.
Which solution will meet these requirements with the LEAST effort?
- A. Use an AWS CloudHSM cluster to store the encryption keys. Configure the process that writes to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects. Deploy an IAM policy that restricts access to the CloudHSM cluster.
- B. Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objects that contain customer information. Restrict Access to the keys that encrypt the objects.
- C. Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects.
- D. Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the Amazon S3 managed keys that encrypt the objects.
View question →
-
Q56. A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics
Workloads that have unpredictable and variable data access patterns.
The company does not access some data for months. However, the company must be able to retrieve all data within milliseconds. The
Company needs to optimize S3 storage costs.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use S3 Storage Lens standard metrics to determine when to move objects to more cost-optimized storage classes. Create S3 Lifecycle policies for the S3 buckets to move objects to cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the future to optimize storage costs.
- B. Use S3 Storage Lens activity metrics to identify S3 buckets that the company accesses infrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3 Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on the age of the data.
- C. Use S3 Intelligent-Tiering. Activate the Deep Archive Access tier.
- D. Use S3 Intelligent-Tiering. Use the default access tier.
View question →
-
Q57. A data engineer uses Amazon Redshift to run resource-intensive analytics processes once every month. Every month, the data engineer
Creates a new Redshift provisioned cluster. The data engineer deletes the Redshift provisioned cluster after the analytics processes are
Complete every month. Before the data engineer deletes the cluster each month, the data engineer unloads backup data from the cluster to
An Amazon S3 bucket.
The data engineer needs a solution to run the monthly analytics processes that does not require the data engineer to manage the
Infrastructure manually.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon Step Functions to pause the Redshift cluster when the analytics processes are complete and to resume the cluster to run New processes every month.
- B. Use Amazon Redshift Serverless to automatically process the analytics workload.
- C. Use the AWS CLI to automatically process the analytics workload.
- D. Use AWS CloudFormation templates to automatically process the analytics workload.
View question →
-
Q58. A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size.
A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer needs to determine the number of distinct customers in the file.
Which solution will meet this requirement with the LEAST operational effort?
- A. Create and run an Apache Spark job in an AWS Glue notebook. Configure the job to read the S3 file and calculate the number of distinct customers.
- B. Create an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file. Run SQL queries from Amazon Athena to calculate the number of distinct customers.
- C. Create and run an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers.
- D. Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers.
View question →
-
Q59. A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records.
A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift
- B. Use the streaming ingestion feature of Amazon Redshift.
- C. Load the data into Amazon S3. Use the COPY command to load the data into Amazon Redshift.
- D. Use the Amazon Aurora zero-ETL integration with Amazon Redshift.
View question →
-
Q60. A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server
Databases, Amazon DynamoDB tables that are in provisioned capacity mode, and an Amazon Redshift cluster. A data engineering team must
Develop a solution that will give data scientists the ability to query all data sources by using syntax similar to SQL.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Amazon Athena to query the data. Use SQL For structured data sources. Use PartiQL for data that is stored in JSON format.
- B. Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Redshift Spectrum to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.
- C. Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use AWS Glue jobs to transform data that is in JSON format to Apache Parquet or .csv format. Store the transformed data in an S3 bucket. Use Amazon Athena to query the original and Transformed data from the S3 bucket.
- D. Use AWS Lake Formation to create a data lake. Use Lake Formation jobs to transform the data from all data sources to Apache Parquet Format. Store the transformed data in an S3 bucket. Use Amazon Athena or Redshift Spectrum to query the data.
View question →
-
Q61. A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.
The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio.
Which change should the engineer make to gain access to SageMaker Studio?
- A. Add the AWSGlueServiceRole managed policy to the data engineer's IAM user.
- B. Add a policy to the data engineer's IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service Principals in the trust policy.
- C. Add the AmazonSageMakerFullAccess managed policy to the data engineer's IAM user.
- D. Add a policy to the data engineer's IAM user that allows the sts:AddAssociation action for the AWS Glue and SageMaker service Principals in the trust policy.
View question →
-
Q62. A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache
Kafka, and Amazon DynamoDB. Some of the data sources have undened data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load
the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of
data creation.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
- B. Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
- C. Create a PySpark program in AWS Lambda to extract, transform, and load the data into the S3 bucket.
- D. Create a stored procedure in Amazon Redshift to detect the schema and to extract, transform, and load the data into a Redshift Spectrum table. Access the table from Amazon S3.
View question →
-
Q63. A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.
To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset.
Which solution will meet the requirements with the LEAST operational overhead?
- A. Create an S3 bucket policy to limit the access each application has. Create multiple copies of the dataset. Give each dataset copy the Appropriate level of redaction for the needs of the application that accesses the copy.
- B. Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic Within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data.
- C. Use AWS Glue to transform the data for each application. Create multiple copies of the dataset. Give each dataset copy the appropriate Level of redaction for the needs of the application that accesses the copy.
- D. Create an API Gateway endpoint that has custom authorizers. Use the API Gateway endpoint to read data from the S3 bucket. Initiate a REST API call to dynamically redact PII based on the needs of each application that accesses the data.
View question →
-
Q64. A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?
- A. Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
- B. Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
- C. Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
- D. Write an AWS Glue Python shell job. Use pandas to transform the data.
View question →
-
Q65. A data engineer creates an AWS Glue Data Catalog table by using an AWS Glue crawler that is named Orders. The data engineer wants to add the following new partitions:
s3://transactions/orders/order_date=2023-01-01
s3://transactions/orders/order_date=2023-01-02
The data engineer must edit the metadata to include the new partitions in the table without scanning all the folders and files in the location of the table.
Which data definition language (DDL) statement should the data engineer use in Amazon Athena?
- A. ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/order_date=2023-01-01’; ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/order_date=2023-01-02’;
- B. MSCK REPAIR TABLE Orders;
- C. REPAIR TABLE Orders;
- D. ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/2023-01-01’; ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/2023-01-02’;
View question →
-
Q66. A company stores 10 to 15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine.
The company wants to transform the data to optimize query runtime and storage costs.
Which file format and compression solution will meet these requirements for Athena queries?
- A. csv format compressed with zip
- B. JSON format compressed with bzip2
- C. Apache Parquet format compressed with Snappy
- D. Apache Avro format compressed with LZO
View question →
-
Q67. A company uses Apache Airflow to orchestrate the company's current on-premises data pipelines. The company runs SQL data quality check tasks as part of the pipelines. The company wants to migrate the pipelines to AWS and to use AWS managed services.
Which solution will meet these requirements with the LEAST amount of refactoring?
- A. Setup AWS Outposts in the AWS Region that is nearest to the location where the company uses Airflow. Migrate the servers into Outposts hosted Amazon EC2 instances. Update the pipelines to interact with the Outposts hosted EC2 instances instead of the on-premises pipelines.
- B. Create a custom Amazon Machine Image (AMI) that contains the Airflow application and the code that the company needs to migrate. Use the custom AMI to deploy Amazon EC2 instances. Update the network connections to interact with the newly deployed EC2 instances.
- C. Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Create the data quality checks during the ingestion to validate the data quality by using SQL tasks in Airflow.
- D. Convert the pipelines to AWS Step Functions workflows. Recreate the data quality checks in SQL as Python based AWS Lambda functions.
View question →
-
Q68. A company uses Amazon EMR as an extract, transform, and load (ETL) pipeline to transform data that comes from multiple sources. A data
Engineer must orchestrate the pipeline to maximize performance.
Which AWS service will meet this requirement MOST cost effectively?
- A. Amazon EventBridge
- B. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
- C. AWS Step Functions
- D. AWS Glue Workflows
View question →
-
Q69. An online retail company stores Application Load Balancer (ALB) access logs in an Amazon S3 bucket. The company wants to use Amazon Athena to query the logs to analyze traffic patterns.
A data engineer creates an unpartitioned table in Athena. As the amount of the data gradually increases, the response time for queries also increases. The data engineer wants to improve the query performance in Athena.
Which solution will meet these requirements with the LEAST operational effort?
- A. Create an AWS Glue job that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
- B. Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
- C. Create an AWS Lambda function to transform all ALB access logs. Save the results to Amazon S3 in Apache Parquet format. Partition the metadata. Use Athena to query the transformed data.
- D. Use Apache Hive to create bucketed tables. Use an AWS Lambda function to transform all ALB access logs.
View question →
-
Q70. A company has a business intelligence platform on AWS. The company uses an AWS Storage Gateway Amazon S3 File Gateway to transfer files from the company's on-premises environment to an Amazon S3 bucket.
A data engineer needs to setup a process that will automatically launch an AWS Glue workflow to run a series of AWS Glue jobs when each file transfer finishes successfully.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Determine when the file transfers usually finish based on previous successful file transfers. Set up an Amazon EventBridge scheduled event to initiate the AWS Glue jobs at that time of day.
- B. Set up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event.
- C. Set up an on-demand AWS Glue workflow so that the data engineer can start the AWS Glue workflow when each file transfer is complete.
- D. Set up an AWS Lambda function that will invoke the AWS Glue Workflow. Set up an event for the creation of an S3 object as a trigger for the Lambda function.
View question →
-
Q71. A manufacturing company has many IoT devices in facilities around the world. The company uses Amazon Kinesis Data Streams to collect
Data from the devices. The data includes device ID, capture date, measurement type, measurement value, and facility ID. The company uses
Facility ID as the partition key.
The company's operations team recently observed many WriteThroughputExceeded exceptions. The operations team found that some shards
Were heavily used but other shards were generally idle.
How should the company resolve the issues that the operations team observed?
- A. Change the partition key from facility ID to a randomly generated key.
- B. Increase the number of shards.
- C. Archive the data on the producer's side.
- D. Change the partition key from facility ID to capture date.
View question →
-
Q72. A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.
The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost of each operation in a SQL query.
Which statement does the data engineer need to run to meet these requirements?
- A. EXPLAIN SELECT ∗ FROM sales;
- B. EXPLAIN ANALYZE FROM sales;
- C. EXPLAIN ANALYZE SELECT ∗ FROM sales;
- D. EXPLAIN FROM sales;
View question →
-
Q73. A company plans to provision a log delivery stream within a VPC. The company configured the VPC flow logs to publish to Amazon CloudWatch Logs. The company needs to send the flow logs to Splunk in near real time for further analysis.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the data stream.
- B. Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the delivery stream.
- C. Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the delivery stream.
- D. Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the data stream.
View question →
-
Q74. A company has a data lake on AWS. The data lake ingests sources of data from business units. The company uses Amazon Athena for queries. The storage layer is Amazon S3 with an AWS Glue Data Catalog as a metadata repository.
The company wants to make the data available to data scientists and business analysts. However, the company first needs to manage fine-grained, column-level data access for Athena based on the user roles and responsibilities.
Which solution will meet these requirements?
- A. Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation.
- B. Define an IAM resource-based policy for AWS Glue tables. Attach the same policy to IAM user groups.
- C. Define an IAM identity-based policy for AWS Glue tables. Attach the same policy to IAM roles. Associate the IAM roles with IAM groups that contain the users.
- D. Create a resource share in AWS Resource Access Manager (AWS RAM) to grant access to IAM users.
View question →
-
Q75. A company has developed several AWS Glue extract, transform, and load (ETL) jobs to validate and transform data from Amazon S3. The ETL
Jobs load the data into Amazon RDS for MySQL in batches once every day. The ETL jobs use a DynamicFrame to read the S3 data.
The ETL jobs currently process all the data that is in the S3 bucket. However, the company wants the jobs to process only the daily
Incremental data.
Which solution will meet this requirement with the LEAST coding effort?
- A. Create an ETL job that reads the S3 file status and logs the status in Amazon DynamoDB.
- B. Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data.
- C. Enable job metrics for the ETL jobs to help keep track of processed objects in Amazon CloudWatch.
- D. Configure the ETL jobs to delete processed objects from Amazon S3 after each run.
View question →
-
Q76. An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network traffic.
Which solution will meet these requirements MOST cost-effectively?
- A. Publish flow logs to Amazon CloudWatch Logs. Use Amazon Athena for analytics.
- B. Publish flow logs to Amazon CloudWatch Logs. Use an Amazon OpenSearch Service cluster for analytics.
- C. Publish flow logs to Amazon S3 in text format. Use Amazon Athena for analytics.
- D. Publish flow logs to Amazon S3 in Apache Parquet format. Use Amazon Athena for analytics.
View question →
-
Q77. A retail company stores transactions, store locations, and customer information tables in four reserved ra3.4xlarge Amazon Redshift cluster
nodes. All three tables use even table distribution.
The company updates the store location table only once or twice every few years.
A data engineer notices that Redshift queues are slowing down because the whole store location table is constantly being broadcast to all
four compute nodes for most queries. The data engineer wants to speed up the query performance by minimizing the broadcasting of the
store location table.
Which solution will meet these requirements in the MOST cost-effective way?
- A. Change the distribution style of the store location table from EVEN distribution to ALL distribution.
- B. Change the distribution style of the store location table to KEY distribution based on the column that has the highest dimension.
- C. Add a join column named store_id into the sort key for all the tables.
- D. Upgrade the Redshift reserved node to a larger instance size in the same instance family.
View question →
-
Q78. A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all rows that have a city_name that starts with "San" or "El".
Which SQL query will meet this requirement?
- A. Select ∗ from Sales where city_name ~ ‘$(San|El)∗’;
- B. Select ∗ from Sales where city_name ~ ‘^(San|El)∗’;
- C. Select ∗ from Sales where city_name ~’$(San&El)∗’;
- D. Select ∗ from Sales where city_name ~ ‘^(San&El)∗’;
View question →
-
Q79. A company needs to send customer call data from its on-premises PostgreSQL database to AWS to generate near real-time insights. The solution must capture and load updates from operational data stores that run in the PostgreSQL database. The data changes continuously.
A data engineer configures an AWS Database Migration Service (AWS DMS) ongoing replication task. The task reads changes in near real time from the PostgreSQL source database transaction logs for each table. The task then sends the data to an Amazon Redshift cluster for processing.
The data engineer discovers latency issues during the change data capture (CDC) of the task. The data engineer thinks that the PostgreSQL source database is causing the high latency.
Which solution will confirm that the PostgreSQL database is the source of the high latency?
- A. Use Amazon CloudWatch to monitor the DMS task. Examine the CDCIncomingChanges metric to identify delays in the CDC from the source database
- B. Verify that logical replication of the source database is configured in the postgresql.conf configuration file.
- C. Enable Amazon CloudWatch Logs for the DMS endpoint of the source database. Check for error messages.
- D. Use Amazon CloudWatch to monitor the DMS task. Examine the CDCLatencySource metric to identify delays in the CDC from the source database.
View question →
-
Q80. A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A
Downstream process will read the data from an Amazon S3 bucket every 30 seconds.
Which solution will deliver the data to the S3 bucket with the LEAST latency?
- A. Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer Interval for Kinesis Data Firehose.
- B. Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards.
- C. Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval From an application.
- D. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use a 5 second buffer interval for Kinesis Data Firehose.
View question →
-
Q81. A retail company uses AWS Glue for extract, transform, and load (ETL) operations on a dataset that contains information about customer orders. The company wants to implement specific validation rules to ensure data accuracy and consistency.
Which solution will meet these requirements?
- A. Use AWS Glue job bookmarks to track the data for accuracy and consistency.
- B. Create custom AWS Glue Data Quality rulesets to define specific data quality checks.
- C. Use the built-in AWS Glue Data Quality transforms for standard data quality validations.
- D. Use AWS Glue Data Catalog to maintain a centralized data schema and metadata repository
View question →
-
Q82. An insurance company stores transaction data that the company compressed with gzip.
The company needs to query the transaction data for occasional audits.
Which solution will meet this requirement in the MOST cost-effective way?
- A. Store the data in Amazon Glacier Flexible Retrieval. Use Amazon S3 Glacier Select to query the data.
- B. Store the data in Amazon S3. Use Amazon S3 Select to query the data.
- C. Store the data in Amazon S3. Use Amazon Athena to query the data.
- D. Store the data in Amazon Glacier Instant Retrieval. Use Amazon Athena to query the data.
View question →
-
Q83. A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure on a daily basis.
Which solution will meet this requirement in the MOST cost-effective way?
- A. Create an AWS Lambda function to schedule a cron job to run the stored procedure.
- B. Schedule and run the stored procedure by using the Amazon Redshift Data API in an Amazon EC2 Spot Instance.
- C. Use query editor v2 to run the stored procedure on a schedule.
- D. Schedule an AWS Glue Python shell job to run the stored procedure.
View question →
-
Q84. A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resources.
Which service should the data engineer use in both the on-premises environment and the cloud-based environment?
- A. AWS Data Exchange
- B. Amazon Simple Workflow Service (Amazon SWF)
- C. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
- D. AWS Glue
View question →
-
Q85. A gaming company uses a NoSQL database to store customer information. The company is planning to migrate to AWS.
The company needs a fully managed AWS solution that will handle high online transaction processing (OLTP) workload, provide single-digit
Millisecond performance, and provide high availability around the world.
Which solution will meet these requirements with the LEAST operational overhead?
- A. Amazon Keyspaces (for Apache Cassandra)
- B. Amazon DocumentDB (with MongoDB compatibility)
- C. Amazon DynamoDB
- D. Amazon Timestream
View question →
-
Q86. A data engineer creates an AWS Lambda function that an Amazon EventBridge event will invoke. When the data engineer tries to invoke the
Lambda function by using an EventBridge event, an AccessDeniedException message appears.
How should the data engineer resolve the exception?
- A. Ensure that the trust policy of the Lambda function execution role allows EventBridge to assume the execution role.
- B. Ensure that both the IAM role that EventBridge uses and the Lambda function's resource-based policy have the necessary permissions.
- C. Ensure that the subnet where the Lambda function is deployed is configured to be a private subnet.
- D. Ensure that EventBridge schemas are valid and that the event mapping configuration is correct.
View question →
-
Q87. A company uses a data lake that is based on an Amazon S3 bucket. To comply with regulations, the company must apply two layers of server-side encryption to files that are uploaded to the S3 bucket. The company wants to use an AWS Lambda function to apply the necessary encryption.
Which solution will meet these requirements?
- A. Use both server-side encryption with AWS KMS keys (SSE-KMS) and the Amazon S3 Encryption Client.
- B. Use dual-layer server-side encryption with AWS KMS keys (DSSE-KMS).
- C. Use server-side encryption with customer-provided keys (SSE-C) before files are uploaded.
- D. Use server-side encryption with AWS KMS keys (SSE-KMS).
View question →
-
Q88. A data engineer notices that Amazon Athena queries are held in a queue before the queries run.
How can the data engineer prevent the queries from queueing?
- A. Increase the query result limit.
- B. Configure provisioned capacity for an existing workgroup.
- C. Use federated queries.
- D. Allow users who run the Athena queries to an existing workgroup
View question →
-
Q89. A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job.
The data engineer has set the maximum concurrency for the AWS Glue job to 1.
The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.
What is the likely reason the AWS Glue job is reprocessing the files?
- A. The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.
- B. The maximum concurrency for the AWS Glue job is set to 1.
- C. The data engineer incorrectly specified an older version of AWS Glue for the Glue job.
- D. The AWS Glue job does not have a required commit statement
View question →
-
Q90. An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company
Currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate
Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
- A. AWS Lambda
- B. Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)
- C. AWS Step Functions
- D. AWS Glue
View question →
-
Q91. A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM
Application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other
Business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
- A. Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.
- B. Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.
- C. Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.
- D. Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the Tasks.
View question →
-
Q92. A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN
Clause on S3 objects that are stored in separate buckets.
The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users
The ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID)
Properties of the data.
Which solution will meet these requirements MOST cost-effectively?
- A. Amazon S3 Select
- B. Amazon Redshift Spectrum
- C. Amazon Athena
- D. Amazon EMR
View question →
-
Q93. A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named
Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.
Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?
- A. Set up an AWS DMS replication instance in Account_B in eu-west-1.
- B. Set up an AWS DMS replication instance in Account_B in eu-east-1.
- C. Set up an AWS DMS replication instance in a new AWS account in eu-west-1.
- D. Set up an AWS DMS replication instance in Account_A in eu-east-1.
View question →
-
Q94. A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.
The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.
Which solution will meet these requirements?
- A. Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
- B. Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.
- C. Use an AWS Give job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
- D. Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.
View question →
-
Q95. A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format.
Which solution will meet these requirements with the LEAST development effort?
- A. Use Kinesis Data Firehose to convert the .csv files to JSON. Use an AWS Lambda function to store the files in Parquet format.
- B. Use Kinesis Data Firehose to convert the .csv files to JSON and to store the files in Parquet format.
- C. Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON and stores the files in Parquet format.
- D. Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON. Use Kinesis Data Firehose to store the files in Parquet format.
View question →
-
Q96. A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the
Use of TLS 1.2 or above to encrypt the data in transit.
Which solution will meet these requirements?
- A. Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use.
- B. Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above.
- C. Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2
- D. Install an SSL certicate on the Transfer Family server to encrypt data transfers by using TLS 1.2.
View question →
-
Q97. A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates
That an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the
Refactor strategy.
Which solution will meet these requirements with the LEAST management overhead?
- A. Amazon Kinesis Data Streams
- B. Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster
- C. Amazon Kinesis Data Firehose
- D. Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless
View question →
-
Q98. A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.
Which AWS Glue feature should the data engineer use to meet this requirement?
- A. Workflows
- B. Triggers
- C. Job bookmarks
- D. Classifiers
View question →
-
Q99. A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams.
A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline.
Which solution will meet this requirement?
- A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
- B. Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.
- C. Design the data source so events are not ingested into Kinesis Data Streams multiple times.
- D. Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark Streaming in Amazon EMR.
View question →
-
Q100. A company stores logs in an Amazon S3 bucket. When a data engineer attempts to access several log files, the data engineer discovers that some files have been unintentionally deleted.
The data engineer needs a solution that will prevent unintentional file deletion in the future.
Which solution will meet this requirement with the LEAST operational overhead?
- A. Manually back up the S3 bucket on a regular basis.
- B. Enable S3 Versioning for the S3 bucket.
- C. Configure replication for the S3 bucket.
- D. Use an Amazon S3 Glacier storage class to archive the data that is in the S3 bucket.
View question →