Q64 — AWS AIF-C01 Ch.2
Question 64 of 100 | ← Chapter 2
A marketing company uses a large language model (LLM). The company wants to evaluate how the LLM’s response quality changes when minor perturbations are applied to the input in a question-answering task. Which metric should the company use?
- A. Root Mean Square Error (RMSE)
- B. Area Under the ROC Curve (AUC)
- C. F1 Score
- D. Semantic Robustness ✓
Correct Answer: D. Semantic Robustness
Explanation
The most appropriate metric for evaluating changes in LLM response quality under minor input perturbations in a QA task is D. Semantic Robustness. A. RMSE measures regression prediction error and is not suitable for generative QA tasks. B. AUC evaluates binary classification performance and does not apply to generative QA. C. F1 Score assesses precision and recall in classification tasks but is ill-suited for evaluating open-ended generative outputs. D. Semantic Robustness quantifies how consistently an LLM preserves meaning and correctness across small input perturbations—directly measuring stability and quality of semantic output under input variation. Thus, it is the optimal metric for this use case.