Q69 — AWS DEA-C01 Ch.1
Question 69 of 100 | ← Chapter 1
An online retail company stores Application Load Balancer (ALB) access logs in an Amazon S3 bucket. The company wants to use Amazon Athena to query the logs to analyze traffic patterns. A data engineer creates an unpartitioned table in Athena. As the amount of the data gradually increases, the response time for queries also increases. The data engineer wants to improve the query performance in Athena. Which solution will meet these requirements with the LEAST operational effort?
- A. Create an AWS Glue job that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
- B. Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog. ✓
- C. Create an AWS Lambda function to transform all ALB access logs. Save the results to Amazon S3 in Apache Parquet format. Partition the metadata. Use Athena to query the transformed data.
- D. Use Apache Hive to create bucketed tables. Use an AWS Lambda function to transform all ALB access logs.
Correct Answer: B. Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
Explanation
为了改善Athena查询性能,并尽量减少操作努力,数据工程师应选择自动化程度高的解决方案。选项B提供了一个AWSGluecrawler,它可以自动发现数据模式,并将分区元数据写入AWSGlueDataCatalog,这样Athena就可以利用这些分区信息来优化查询性能。相比之下,其他选项需要更多的手动设置和编码工作,如创建AWSGlue作业(选项A)、编写AWSLambda函数并处理数据格式和分区(选项C),或使用ApacheHive并配合Lambda函数(选项D)。因此,B选项提供了最少的操作努力,并能满足改善查询性能的要求。