P55Session 1 (Thursday 9 January 2025, 15:25-17:30)Performance limitations of english-trained speech enhancement models on french phoneme categories
Background: Understanding speech in noisy environments is a challenge for many, especially individuals with hearing loss who rely on clear signals to comprehend spoken language. Speech enhancement (SE) technologies have become a key solution in hearing aids, aiming to reduce background noise and improve the intelligibility of speech. However, SE models trained on one language may not generalize effectively to others, as language-specific phonetic characteristics influence the model's performance.
Rationale: This study explores how SE models trained on English handle interference, artifacts, and distortions in both English and French, shedding light on potential limitations when applying language-specific SE models across different languages.
Method: We employed three SE models to enhance speech under varied noise conditions, using 1000 audio samples from LibriSpeech for English and FHarvard for French, a balanced dataset with phrases recorded by one male and one female native speaker for phonetic diversity. We also included the French LibriSpeech corpus, selected for its rich recordings, diverse speakers, and substantial recording hours.
Results: At the utterance level, English showed slightly higher initial interference but benefited from greater improvement through enhancement, with both languages performing similarly in terms of artifact control and overall distortion reduction. The amount of interference in the original signal, at the phoneme level, was similar for English and French across most vowels. However, English phonemes, particularly those with nasal, plosive and fricative characteristics, tend to exhibit a higher susceptibility to noise interference, while French phonemes show increased interference within specific consonantal categories, such as lateral and affricate sounds. Post-speech enhancement, French and English phonemes show similar levels of interference for sibilant, open-mid, and open categories, while French phonemes are notably more affected in plosive, approximant, and close-mid categories—with a particularly pronounced difference in affricate, lateral, and close sounds—whereas English phonemes exhibit higher residual interference in nasal and fricative sounds. Speech enhancement (SE) models trained on English generally reduce interference more effectively in English than in French across most phoneme categories, with minimal differences for sibilants and fricatives. The models introduce similar levels of artifacts for both languages, except for increased artifacts on French laterals and English affricates. Distortions are also comparable between languages, except for higher distortion levels in French laterals.
Conclusion: This study reveals language-specific limitations in English-trained SE models, as they reduce interference more effectively for English than French, with notable differences in phoneme category performance, highlighting the need for tailored SE solutions across languages.