Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories

Steven Cadavid, Mohamed Abdel-Mottaleb, Daniel S. Messinger, Mohammad H. Mahoor, Lorraine E. Bahrick

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences.

Original languageEnglish (US)
Title of host publicationBritish Machine Vision Conference, BMVC 2009 - Proceedings
PublisherBritish Machine Vision Association, BMVA
ISBN (Print)1901725391, 9781901725391
DOIs
StatePublished - Jan 1 2009
Event2009 20th British Machine Vision Conference, BMVC 2009 - London, United Kingdom
Duration: Sep 7 2009Sep 10 2009

Publication series

NameBritish Machine Vision Conference, BMVC 2009 - Proceedings

Other

Other2009 20th British Machine Vision Conference, BMVC 2009
CountryUnited Kingdom
CityLondon
Period9/7/099/10/09

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories'. Together they form a unique fingerprint.

  • Cite this

    Cadavid, S., Abdel-Mottaleb, M., Messinger, D. S., Mahoor, M. H., & Bahrick, L. E. (2009). Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. In British Machine Vision Conference, BMVC 2009 - Proceedings (British Machine Vision Conference, BMVC 2009 - Proceedings). British Machine Vision Association, BMVA. https://doi.org/10.5244/C.23.10