Q99 — AWS DEA-C01 Ch.1
Question 99 of 100 | ← Chapter 1
A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams. A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline. Which solution will meet this requirement?
- A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source. ✓
- B. Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.
- C. Design the data source so events are not ingested into Kinesis Data Streams multiple times.
- D. Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark Streaming in Amazon EMR.
Correct Answer: A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
Explanation
为了确保整个处理管道中的“恰好一次”传递,需要在数据处理过程中解决可能的重复问题。选项A通过在每条记录中嵌入一个唯一ID,并在处理过程中移除重复项,可以有效地解决这一问题。这种方法确保了即使在网络中断或其他问题导致数据重复发送的情况下,也能通过去重处理来保证数据的唯一性。选项B提到的更新检查点配置可能有助于恢复状态,但不直接解决重复数据处理的问题。选项C提出了避免数据多次进入KinesisDataStreams的方法,但这并不是一个可行的解决方案,因为问题出在数据传输和处理过程中,而不是数据源。选项D提出了使用不同的技术栈,但没有直接解决“恰好一次”传递的问题。因此,选项A是最合适的解决方案。