Q27 — AWS DEA-C01 Ch.1
Question 27 of 100 | ← Chapter 1
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?
- A. Create an AWS Lambda function to identify the changes between the previous data and the current data. Configure the Lambda function to ingest the changes into the data lake.
- B. Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
- C. Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data ✓
- D. Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
Correct Answer: C. Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data
Explanation
考虑到数据文件中有些很小,而有些则达到数十TB,成本效益是选择解决方案的关键因素。选项A提出使用AWSLambda函数来识别数据变化,虽然Lambda可以按需运行,但处理大量数据时可能成本较高。选项B和D都建议先将数据摄入到关系数据库(RDS或Aurora),然后使用AWSDMS来捕获变化数据。这种方法在处理结构化数据时可能有效,但对于半结构化数据,可能不是成本效益最高的选择。选项C建议使用开源数据湖格式来合并数据源与S3数据湖,这种方法可以直接在数据湖上操作,避免了额外的数据摄入和转换成本,因此是成本效益最高的选择。所以,答案是C。