Kallirroi Georgila, Alan W. Black, Kenji Sagae, David Traum: “Practical Evaluation of Human and Synthesized Speech for Virtual Human Dialogue Systems”

May 25, 2012 | Istanbul, Turkey

Speaker: Kallirroi Georgila, Alan W. Black, Kenji Sagae, David Traum
Host: 8th International Conference on Language Resources and Evaluation (LREC 2012)

The current practice in virtual human dialogue systems is to use professional human recordings or limited-domain speech synthesis. Both approaches lead to good performance but at a high cost. To determine the best trade-off between performance and cost, we perform a systematic evaluation of human and synthesized voices with regard to naturalness, conversational aspect, and likability. We also vary the type (in-domain vs. out-of-domain), length, and content of utterances, and take into account the age and native language of raters as well as their familiarity with speech synthesis. We present detailed results from two studies, a pilot one and one run on Amazon’s Mechanical Turk. Our results suggest that a professional human voice can supersede both an amateur human voice and synthesized voices. Also, a high-quality general-purpose voice or a good limited-domain voice can perform better than amateur human recordings. We do not find any significant differences between the performance of a high-quality general-purpose voice and a limited-domain voice, both trained with speech recorded by actors. As expected, in most cases, the high-quality general-purpose voice is rated higher than the limited-domain voice for out-of-domain sentences and lower for in-domain sentences. There is also a not statistically significant trend for long or negative-content utterances to receive lower ratings.