P56Session 2 (Friday 10 January 2025, 09:30-11:30)What you see is what you hear: Exposure to congruent social cues improves speech intelligibility under adverse listening conditions
Background: Social information can affect speech processing. Studies have shown that manipulating listeners' expectations about accent regions or simply exposing them to culturally iconic objects (e.g., stuffed toys, [1]) can significantly bias phoneme categorization. These findings support exemplar-based models of speech perception, where non-acoustic information plays a crucial role in understanding and evaluating speech. Nevertheless, prior research has primarily examined social priming effects at the phoneme level, with findings yet to be replicated in British populations. In this study, we took a different approach and investigated how visual geographical cues depicting London or Glasgow influence the intelligibility of natural sentences produced in two accents: Standard Southern British English (SSBE) and Glaswegian English (GE).
Method: Fifty-three listeners aged 18-50 were recruited via Prolific and completed the experiment online. They transcribed 108 IEEE sentences at 3 signal-to-noise ratios (+3 dB, 0 dB, -3 dB), produced in SSBE and GE accents (2 male speakers per accent). In each accent & noise condition, one-third of the sentences were presented with congruent visual cues (to the accent region) and one-third with incongruent visual cues. The remaining sentences were paired with blank silhouettes giving no informative social information, acting as a baseline condition. Of the participants, 24 were SSBE speakers from southern England with minimal Scottish English exposure, while 29 were Scottish residents familiar with both accents.
Results: Transcription accuracy was operationalized as Token Sort Ratio and modeled using a linear mixed model in R. Unsurprisingly, performance decreased as the noise level increased (χ2(2) = 39.26, p < 0.001). GE speakers achieved higher transcription scores than SSBE speakers overall (χ2(1) = 17.38, p < 0.001; z = 4.17, p < 0.001): GE listeners performed equally well with SSBE & GE sentences at each noise level (z = -0.61, p = 0.54), while SSBE listeners performed more poorly with GE (z = -5.12, p<0.001). Notably, listeners from both accent backgrounds showed higher transcription accuracy when visual cues matched the accent region of the auditory stimuli compared to mismatching cues (χ2(2) = 3.94, p = 0.02). Overall, the results suggest that socially meaningful cues can influence speech-in-noise recognition through a largely automatic process: exposure to visual cues activated corresponding regional concepts, which interacted with the processing of phonetic variation.
Reference: [1] Hay, J., & Drager, K. (2010). Stuffed toys and speech perception. Linguistics, 48(4), 865–892.