Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits (bibtex)
by Chatterjee, Moitreya, Park, Sunghyun, Morency, Louis-Philippe and Scherer, Stefan
Abstract:
Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.
Reference:
Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits (Chatterjee, Moitreya, Park, Sunghyun, Morency, Louis-Philippe and Scherer, Stefan), In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ACM Press, 2015.
Bibtex Entry:
@inproceedings{chatterjee_combining_2015,
	address = {Seattle, Washington},
	title = {Combining {Two} {Perspectives} on {Classifying} {Multimodal} {Data} for {Recognizing} {Speaker} {Traits}},
	isbn = {978-1-4503-3912-4},
	url = {http://dl.acm.org/citation.cfm?doid=2818346.2820747},
	doi = {10.1145/2818346.2820747},
	abstract = {Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.},
	booktitle = {Proceedings of the 2015 {ACM} on {International} {Conference} on {Multimodal} {Interaction}},
	publisher = {ACM Press},
	author = {Chatterjee, Moitreya and Park, Sunghyun and Morency, Louis-Philippe and Scherer, Stefan},
	month = nov,
	year = {2015},
	keywords = {Virtual Humans},
	pages = {7--14}
}
Powered by bibtexbrowser