Q58 — AWS DEA-C01 Ch.1
Question 58 of 100 | ← Chapter 1
A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size. A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer needs to determine the number of distinct customers in the file. Which solution will meet this requirement with the LEAST operational effort?
- A. Create and run an Apache Spark job in an AWS Glue notebook. Configure the job to read the S3 file and calculate the number of distinct customers.
- B. Create an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file. Run SQL queries from Amazon Athena to calculate the number of distinct customers.
- C. Create and run an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers.
- D. Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers. ✓
Correct Answer: D. Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers.
Explanation
考虑到操作简便性和效率,AWSGlueDataBrew提供了一个用户友好的界面,允许数据工程师通过简单的操作来清洗和准备数据。在这个场景中,数据工程师需要将两列数据合并,并计算不同客户的数量。使用AWSGlueDataBrew,可以通过创建一个配方(recipe),使用COUNT_DISTINCT聚合函数直接计算出不同客户的数量,而无需编写复杂的代码或配置额外的服务。因此,选项D提供了最少的操作努力,是满足这一需求的最佳解决方案。