P05Session 1 (Thursday 9 January 2025, 15:25-17:30)Are you interested in making a monaural speech intelligibility model binaural?
When a target speech source is spatially separated from competing sound sources, the target intelligibility is improved thanks to binaural hearing, compared to situations with co-located sources or monaural listening. This spatial release from masking (SRM) cannot be predicted by monaural speech intelligibility models. We are presenting here a binaural front-end that can be combined with monaural models to account for SRM. This front-end is based on the MBSTOI metric. It is implemented in the temporal domain, so that from the noisy speech signals at the two ears it produces a binaurally-enhanced monaural signal that can then be evaluated by monaural intelligibility models.
The front-end was tested here in combination with the monaural model HASPI (Hearing Aid Speech Perception Index) that allows intelligibility predictions for speech degraded by additive noise, reverberation, spectral changes, and nonlinear distortion. HASPI compares the degraded noisy speech signal to a clean speech reference, and accounts for hearing impairment by incorporating a model of the auditory periphery that can represent both impaired and normal hearing.
The model predictions, with or without the use of the binaural front-end, were compared to intelligibility scores from three datasets all involving sound reproduction with headphones, normal-hearing listeners, anechoic conditions, and a frontal (Danish) speech source. In dataset 1, a single stationary speech-shaped noise (SSN) was tested at ten azimuths around the listener at six signal-to-noise ratios (SNRs). In dataset 2, three azimuths and six SNRs were tested for an SSN or a non-stationary noise, with or without ideal binary mask (IBM) processing, which simulates noise reduction in hearing aids. In dataset 3, the competing sound was obtained by mixing in seven different proportions the ear signals resulting from three sources: an SSN simulated at 0-degree azimuth (co-located with the target speech), a diffuse noise coming from all directions, and an SSN simulated at 115-degree azimuth. This noise mixture was tested at eight SNRs.
Because HASPI was originally developed to predict percent correct for English sentences at positive SNRs, the model back-end was re-fitted to predict percent correct for Danish words measured at negative SNRs. This new fitting was done using only the co-located conditions of the three datasets, in which there was no SRM or binaural effects involved.