Q27 — AWS DEA-C01 Ch.1

Question 27 of 100 | ← Chapter 1

A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?

Correct Answer: C. Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data

Explanation

考虑到数据文件中有些很小,而有些则达到数十TB,成本效益是选择解决方案的关键因素。选项A提出使用AWSLambda函数来识别数据变化,虽然Lambda可以按需运行,但处理大量数据时可能成本较高。选项B和D都建议先将数据摄入到关系数据库(RDS或Aurora),然后使用AWSDMS来捕获变化数据。这种方法在处理结构化数据时可能有效,但对于半结构化数据,可能不是成本效益最高的选择。选项C建议使用开源数据湖格式来合并数据源与S3数据湖,这种方法可以直接在数据湖上操作,避免了额外的数据摄入和转换成本,因此是成本效益最高的选择。所以,答案是C。