Manual McGurk effect

Do you hear VOORnaam or voorNAAM?

In the video below, you’ll see a (randomly selected…) talker say a Dutch word. Is he saying VOORnaam (Eng. “first name”, with stress on the first syllable VOOR-) or voorNAAM (Eng. “respectable”, with stress on the second syllable -NAAM).

In other words: where do you hear the stress?

Spoiler: click here to reveal the video specs

  • Audio: ambiguous; midway between VOORnaamvoorNAAM
  • Lips: head taken from a recording of VOORnaam
  • Hands: beat gesture aligned to the first syllable

Now play the video below. Where do you hear the stress now?

Spoiler: click here to reveal the video specs

  • Audio: ambiguous; midway between VOORnaamvoorNAAM
  • Lips: head taken from a recording of VOORnaam
  • Hands: beat gesture aligned to the second syllable

Explanation

The audio in these videos is perfectly identical: it has been manipulated to be ambiguous, falling roughly midway between VOORnaam and voorNAAM. The head of the talker is also the same: it has been copy-pasted from a video recording of the talker saying VOORnaam. The only difference between these two videos is the timing of the hand gesture. In the first clip, the talker produces a beat gesture on the first syllable, while in the second video the talker gestures on the second syllable. Our experiments show that this slight change in timing has major consequences for perception. When we ask a group of Dutch participants to indicate what word they hear the talker say, the majority reports hearing VOORnaam in the first clip, but voorNAAM in the second clip.

Really? Convince me…

This is Figure 1 from Bosker & Peeters (2021). In the bottom left panel, you see the proportion of ‘I hear stress on the first syllable’ responses for when the beat gesture falls on the first syllable (blue line) or on the second syllable (red line). The blue line lies above the red line, indicating an overall bias to report more ‘stress on first syllable’ responses when the gesture falls on the first vs. second syllable. The difference between the lines is sizable, averaging around 20%.

Figure 1 in Bosker & Peeters (2021)
Figure 1 in Bosker & Peeters (2021)

How hands help us hear

When we have a face-to-face conversation, we don’t only exchange sounds. We also move our head, hands, and body to the rhythm of the speech. Beat gestures are relatively ‘simple’ up-and-down hand gestures that are closely aligned to the rhythm of speech. They tend to fall on the stressed syllable in free-stress languages, such as English and Dutch. These videos demonstrate that people are sensitive to the timing of beat gestures, influencing lexical stress perception. In Bosker & Peeters (2021), this effect was termed the manual McGurk effect. That is, just like seeing a talker close their lips can make you hear the sound /b/ in the classic McGurk effect (McGurk & McDonald, 1976), so can the timing of hand gestures influence speech perception in the manual McGurk effect.

Why is this important?

The manual McGurk effect is the first demonstration of how the timing of hand gestures influences low-level speech perception. Even the simplest flicks-of-the-hands that do not convey any particular meaning of themselves can shape what words you hear. This promises that these seemingly unimportant hand gestures contribute meaningfully to audiovisual speech comprehension. Perhaps ’enriching’ our speech with carefully timed gestures can help our audience understand our spoken message, particularly in challenging listening conditions, such as in noise.

Relevant papers

(2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B: Biological Sciences, 288, 20202419, doi:10.1098/rspb.2020.2419.

PDF Cite Dataset DOI

(2022). Visible lexical stress cues on the face do not influence audiovisual speech perception. In Proceedings of Speech Prosody 2022 (ed. S. Frota, M. Cruz, and M. Vigário), 259-263, doi:10.21437/SpeechProsody.2022-53.

PDF Cite Dataset DOI

Ronny Bujok, Antje S. Meyer, and Hans Rutger Bosker (2022). Audiovisual perception of lexical stress: Beat gestures are stronger visual cues for lexical stress than visible articulatory cues on the face. PsyArXiv Preprints, doi:10.31234/osf.io/y9jck, data:https://osf.io/4d9w5/

Previous