Q62 — AWS DEA-C01 Ch.1
Question 62 of 100 | ← Chapter 1
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undened data schemas or data schemas that change. A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation. Which solution will meet these requirements with the LEAST operational overhead?
- A. Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
- B. Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark. ✓
- C. Create a PySpark program in AWS Lambda to extract, transform, and load the data into the S3 bucket.
- D. Create a stored procedure in Amazon Redshift to detect the schema and to extract, transform, and load the data into a Redshift Spectrum table. Access the table from Amazon S3.
Correct Answer: B. Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
Explanation
AWSGlue是专门为处理各种数据源的提取、转换和加载(ETL)任务而设计的服务,能够自动检测数据源的模式变化,并满足将数据在15分钟内加载到S3桶的要求,同时操作开销相对较小。相比之下,AmazonEMR虽然也能实现相关功能,但配置和管理相对复杂,操作开销较大。AWSLambda对于处理如此大量的数据和复杂的ETL流程可能不太适合,且在处理速度和资源管理上存在局限性。AmazonRedshift主要用于数据仓库,对于这种多源数据的实时处理和快速加载到S3桶并非最优选择。综上所述,选项B是满足要求且操作开销最小的方案。