Speech recognition in adverse conditions: synthetic transformations and real environmental noise.
Saved in:
| Title: | Speech recognition in adverse conditions: synthetic transformations and real environmental noise. |
|---|---|
| Authors: | Katkov, Sergei1 (AUTHOR) sergei.katkov@student.unibz.it, Liotta, Antonio1 (AUTHOR) antonio.liotta@unibz.it, Vietti, Alessandro1 (AUTHOR) alessandro.vietti@unibz.it |
| Source: | EURASIP Journal on Audio Speech & Music Processing. 4/10/2026, Vol. 2026 Issue 1, p1-26. 26p. |
| Subjects: | Automatic speech recognition, Noise pollution, Data augmentation, Speech perception, Robust statistics |
| Abstract: | While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios. [ABSTRACT FROM AUTHOR] |
| Copyright of EURASIP Journal on Audio Speech & Music Processing is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Engineering Source |
|
Full text is not displayed to guests.
Login for full access.
|
|
| Abstract: | While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 16874714 |
| DOI: | 10.1186/s13636-026-00458-1 |