Q82 — AWS AIF-C01 Ch.3
Question 82 of 100 | ← Chapter 3
A company is developing an ML model to generate natural-language responses for a customer service chatbot. It needs to evaluate how similar the model’s generated responses are to subject-matter expert (SME) responses. The company has a dataset of SME-validated question-answer pairs. Which metric should the company use to evaluate model performance?
- A. BERTScore ✓
- B. Mean Squared Error (MSE)
- C. Perplexity
- D. F1 Score
Correct Answer: A. BERTScore
Explanation
Evaluating semantic similarity between generated and SME-authored responses requires a metric designed for natural language text comparison—not numeric regression (MSE), language modeling fluency alone (perplexity), or classification alignment (F1). BERTScore computes token-level similarity using contextual embeddings from BERT, offering high correlation with human judgment for response quality and factual alignment. It is widely adopted for evaluating generative chatbot outputs against reference answers and is ideal for this SME-grounded evaluation scenario.