Dhaval Shah, Kyu J. Han, Shrikanth Narayanan: “A Low-Complexity Dynamic Face-Voice Feature Fusion Approach to Multimodal Person Recognition”

December 14, 2009 | San Diego, CA

Speaker: Dhaval Shah, Kyu J. Han, Shrikanth Narayanan
Host: IEEE International Symposium on Multimedia (ISM2009)

In this paper, we show the importance of face-voice correlation for audio-visual person recognition. We evaluate the performance of a system which uses the correlation between audio-visual features during speech against audioonly, video-only and audio-visual systems which use audio and visual features independently neglecting the interdependency of a person’s spoken utterance and the associated facial movements. Experiments performed on the Vid-TIMIT dataset show that the proposed multimodal scheme has lower error rate than all other comparison conditions and is more robust against replay attacks. The simplicity of the fusion technique also allows the use of only one classifier which greatly simplifies system design and allows for a simple real-time DSP implementation.