Q24 — AWS DEA-C01 Ch.1

Question 24 of 100 | ← Chapter 1

A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the Legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?

Correct Answer: B. Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data To perform data deduplication.

Explanation

选择B选项因为AWSGlue是专为AWS环境设计的ETL服务,它提供了内置的FindMatches机器学习转换,可以直接用于数据去重,减少了编写和维护自定义脚本的需要。FindMatches基于机器学习算法,能有效识别并处理数据中的重复项,具有较低的运营开销。相比之下,使用Pandas库或dedupe库虽然也能实现数据去重,但可能需要更多的开发和维护工作,特别是在AWS环境中。因此,B选项是满足要求且运营开销最小的解决方案。