Testing Sentence-in-Noise Recognition With Synthetic Speech and Automatic Speech Recognition.

Saved in:

Bibliographic Details
Title:	Testing Sentence-in-Noise Recognition With Synthetic Speech and Automatic Speech Recognition.
Authors:	Calandruccio, Lauren¹ lauren.calandruccio@case.edu, Weidman, Dani², Leatherwood, Aja¹, Buss, Emily³
Source:	Journal of Speech, Language & Hearing Research. Dec2025, Vol. 68 Issue 12, p6114-6128. 15p.
Subject Terms:	Auditory perception testing, Data analysis, Artificial intelligence, Speech perception, Auditory perception, Comparative studies, Automatic speech recognition, Noise, Research funding, Statistical sampling, Descriptive statistics, Statistics, Acoustic stimulation
Abstract:	Purpose: Characterizing speech-in-noise recognition is fundamental to both clinical audiology and hearing research. Current methods rely on human speech recordings and human testers. However, modern artificial intelligence tools could automate both stimulus generation and scoring. This report evaluated masked-sentence recognition with synthetic and human speech productions and human and machine scoring methods. Methods: Participants were young adults with normal hearing who were native speakers of the test language (English). Participants completed a speech-in-noise recognition task for open-set sentences at -6 dB signal-to-noise ratio for 10 different target talkers (five human and five synthetic). Automatic speech recognition was used in addition to human scoring to determine listener performance. Participants also provided perceptual ratings using a Likert rating scale to determine if they could identify which talkers were human and which were synthetic. Results: Speech recognition scores varied across the 10 talkers, with a trend for greater intelligibility for synthetic than human talkers and greater variability across human than synthetic talkers. However, the pattern of individual differences in recognition across participants was similar for human and synthetic speech. Agreement between scores produced by human testers and automatic speech recognition was high (~98% agreement). Perceptual ratings indicate that some synthetic talkers sounded more human than others, but ratings did not predict recognition accuracy. Conclusions: Speech-in-noise recognition varied for different human and synthetic talkers, with some indication of greater consistency in intelligibility for synthetic speech. This variability did not seem to be related to perceived human likeness. Human scoring was more accurate than automatic machine scoring for open-set sentences, but results were in close agreement for both methods. These results provide tentative support for the use of synthetic speech and machine scoring when evaluating masked-sentence recognition. [ABSTRACT FROM AUTHOR]
	Copyright of Journal of Speech, Language & Hearing Research is the property of American Speech-Language-Hearing Association and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Education Research Complete

Description
Abstract:	Purpose: Characterizing speech-in-noise recognition is fundamental to both clinical audiology and hearing research. Current methods rely on human speech recordings and human testers. However, modern artificial intelligence tools could automate both stimulus generation and scoring. This report evaluated masked-sentence recognition with synthetic and human speech productions and human and machine scoring methods. Methods: Participants were young adults with normal hearing who were native speakers of the test language (English). Participants completed a speech-in-noise recognition task for open-set sentences at -6 dB signal-to-noise ratio for 10 different target talkers (five human and five synthetic). Automatic speech recognition was used in addition to human scoring to determine listener performance. Participants also provided perceptual ratings using a Likert rating scale to determine if they could identify which talkers were human and which were synthetic. Results: Speech recognition scores varied across the 10 talkers, with a trend for greater intelligibility for synthetic than human talkers and greater variability across human than synthetic talkers. However, the pattern of individual differences in recognition across participants was similar for human and synthetic speech. Agreement between scores produced by human testers and automatic speech recognition was high (~98% agreement). Perceptual ratings indicate that some synthetic talkers sounded more human than others, but ratings did not predict recognition accuracy. Conclusions: Speech-in-noise recognition varied for different human and synthetic talkers, with some indication of greater consistency in intelligibility for synthetic speech. This variability did not seem to be related to perceived human likeness. Human scoring was more accurate than automatic machine scoring for open-set sentences, but results were in close agreement for both methods. These results provide tentative support for the use of synthetic speech and machine scoring when evaluating masked-sentence recognition. [ABSTRACT FROM AUTHOR]
ISSN:	10924388
DOI:	10.1044/2025_JSLHR-24-00893