Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories

Steven Cadavid, Mohamed Abdel-Mottaleb, Daniel S Messinger, Mohammad H. Mahoor, Lorraine E. Bahrick

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences.

Original languageEnglish
Title of host publicationBritish Machine Vision Conference, BMVC 2009 - Proceedings
PublisherBritish Machine Vision Association, BMVA
ISBN (Print)1901725391, 9781901725391
DOIs
StatePublished - Jan 1 2009
Event2009 20th British Machine Vision Conference, BMVC 2009 - London, United Kingdom
Duration: Sep 7 2009Sep 10 2009

Other

Other2009 20th British Machine Vision Conference, BMVC 2009
CountryUnited Kingdom
CityLondon
Period9/7/099/10/09

Fingerprint

Trajectories
Experiments

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Cadavid, S., Abdel-Mottaleb, M., Messinger, D. S., Mahoor, M. H., & Bahrick, L. E. (2009). Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. In British Machine Vision Conference, BMVC 2009 - Proceedings British Machine Vision Association, BMVA. https://doi.org/10.5244/C.23.10

Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. / Cadavid, Steven; Abdel-Mottaleb, Mohamed; Messinger, Daniel S; Mahoor, Mohammad H.; Bahrick, Lorraine E.

British Machine Vision Conference, BMVC 2009 - Proceedings. British Machine Vision Association, BMVA, 2009.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cadavid, S, Abdel-Mottaleb, M, Messinger, DS, Mahoor, MH & Bahrick, LE 2009, Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. in British Machine Vision Conference, BMVC 2009 - Proceedings. British Machine Vision Association, BMVA, 2009 20th British Machine Vision Conference, BMVC 2009, London, United Kingdom, 9/7/09. https://doi.org/10.5244/C.23.10
Cadavid S, Abdel-Mottaleb M, Messinger DS, Mahoor MH, Bahrick LE. Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. In British Machine Vision Conference, BMVC 2009 - Proceedings. British Machine Vision Association, BMVA. 2009 https://doi.org/10.5244/C.23.10
Cadavid, Steven ; Abdel-Mottaleb, Mohamed ; Messinger, Daniel S ; Mahoor, Mohammad H. ; Bahrick, Lorraine E. / Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. British Machine Vision Conference, BMVC 2009 - Proceedings. British Machine Vision Association, BMVA, 2009.
@inproceedings{786cef6066c14bf9b17a66485053c598,
title = "Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories",
abstract = "We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences.",
author = "Steven Cadavid and Mohamed Abdel-Mottaleb and Messinger, {Daniel S} and Mahoor, {Mohammad H.} and Bahrick, {Lorraine E.}",
year = "2009",
month = "1",
day = "1",
doi = "10.5244/C.23.10",
language = "English",
isbn = "1901725391",
booktitle = "British Machine Vision Conference, BMVC 2009 - Proceedings",
publisher = "British Machine Vision Association, BMVA",

}

TY - GEN

T1 - Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories

AU - Cadavid, Steven

AU - Abdel-Mottaleb, Mohamed

AU - Messinger, Daniel S

AU - Mahoor, Mohammad H.

AU - Bahrick, Lorraine E.

PY - 2009/1/1

Y1 - 2009/1/1

N2 - We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences.

AB - We describe a novel approach for determining the audio-visual synchrony of a monologue video sequence utilizing vocal pitch and facial landmark trajectories as descriptors of the audio and visual modalities, respectively. The visual component is represented by the horizontal and vertical displacement of corresponding facial landmarks between subsequent frames. These facial landmarks are acquired using the statistical modeling technique, known as the Active Shape Model (ASM). The audio component is represented by the fundamental frequency, or pitch, obtained using the subharmonic-to-harmonic ratio (SHR). The synchrony between the audio and visual feature vectors is computed using Gaussian mutual information. The raw synchrony estimates obtained using this method may contain spurious synchrony values due to over-sensitivity. A filtering method is employed for discarding synchrony values that occur during non-associated audio/visual events. The human visual system is capable of distinguishing rigid and non-rigid motion of an articulator during speech. In an attempt to emulate this process, we separate rigid and non-rigid motion and compute the synchrony attributed to each. Experiments are conducted on a dataset of monologue video clip pairs. Each pair is composed of an asynchronous and synchronous version of the video clip. For the asynchronous video clips, the audio signal is displaced with respect to the visual signal. Experimental results indicate that the proposed approach is successful in detecting facial regions that demonstrate synchrony, and in distinguishing between synchronous and asynchronous sequences.

UR - http://www.scopus.com/inward/record.url?scp=84898841607&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898841607&partnerID=8YFLogxK

U2 - 10.5244/C.23.10

DO - 10.5244/C.23.10

M3 - Conference contribution

SN - 1901725391

SN - 9781901725391

BT - British Machine Vision Conference, BMVC 2009 - Proceedings

PB - British Machine Vision Association, BMVA

ER -