Building AI to Connect—Not Just Compute

Published: August 18, 2025

Dr. Minh Tran receives doctorate - this essay describes his research

ICT is proud to acknowledge that Minh Tran has successfully defended his doctoral thesis: “Self-supervised Learning for Social Behavior Understanding and Generation,” under the supervision of Dr. Mohammad Soleymani, Director, Intelligent Human Perception Lab, within the Affective Computing Group, led by Dr. Jonathan Gratch. While studying for his doctorate, Tran also took on applied scientist internships at both Amazon and Meta. In this essay, Dr. Minh Tran talks about his research into emotion recognition, nonverbal behavior generation and understanding human reaction to robots.

By Dr. Minh Tran, AI Researcher, Intelligent Human Perception Lab, ICT

Imagine an AI assistant that doesn’t just understand your words, but also notices when your voice sounds stressed, when you’re avoiding eye contact, or when you’re genuinely excited versus just being polite. This isn’t science fiction—it’s the cutting-edge research that has driven my doctoral work and could transform how machines interact with humans.

As AI systems move from our computers into our homes, schools, hospitals, and workplaces, there’s a growing recognition that machines need social intelligence—the ability to read between the lines of human behavior and respond appropriately. My doctoral research, supervised by Dr. Mohammad Soleymani, Director, Intelligent Human Perception Lab, tackles this challenge head-on, using innovative techniques to teach machines the subtle art of human social interaction.

The Challenge: Decoding Human Complexity

Human social behavior is extraordinarily complex. A slight change in vocal tone, a momentary pause, a raised eyebrow—these tiny signals carry enormous meaning in human interaction. Traditional AI approaches struggle here because teaching a computer to recognize these nuances would require humans to manually label millions of examples, an impossibly expensive and subjective task.

Our solution? Let the machines teach themselves. Using a technique called self-supervised learning (SSL), our systems learn to understand social cues by studying patterns in raw video and audio data—no human annotations required. It’s like giving AI the ability to people-watch and learn from observation, much like humans do.

Four Breakthrough Innovations

Through collaborative work with my supervisor and colleagues, we’ve made four key advances that form the foundation of next-generation social AI:

1. Pushing State-of-the-Art in Human Behavior Understanding

Our first major contribution advances the ability of machines to interpret the nuanced and ever-changing landscape of human emotions. Rather than simply categorizing feelings as “happy” or “sad,” our SAAML framework—presented at a leading computer science conference—enables AI systems to recognize a broad spectrum of emotional states as they evolve, even when training data is scarce and across diverse individuals and contexts.

We’ve also extended these capabilities to wearable cameras, allowing AI to learn about social roles from a first-person perspective. This innovation has the potential to transform assistive technologies; for example, smart glasses could help individuals with autism navigate social interactions by providing real-time insights into their role within a conversation.

2. Generating Realistic Human Behavior

Understanding social cues is only half the battle; generating appropriate responses is equally crucial. Working closely with my collaborators, including Chang and others, we developed a system that can create realistic nonverbal behaviors—the kind of subtle head nods, eye blinks, and facial expressions that make interactions feel natural.

Our Dyadic Interaction Modeling (DIM) system learns from watching real conversations, capturing the intricate interaction between speakers and listeners. When someone talks, a good listener doesn’t just stand there—they nod at the right moments, show understanding through micro-expressions, and maintain appropriate eye contact. Our AI can now generate these behaviors automatically, opening doors for more natural virtual avatars, therapeutic training tools, and conversational agents.

3. Personalizing to Individual Styles

One size doesn’t fit all when it comes to human expression. Some people are naturally more animated; others are subtle in their emotional displays. Working with Yin, our SetPeER system tackles this by learning to adapt to individual communication styles using just a handful of examples—sometimes as few as eight samples per person.

To test this, our team created EmoCeleb, a dataset containing over 150 hours of audiovisual content from approximately 1,500 different speakers. The results were striking: personalized systems consistently outperformed one-size-fits-all approaches, suggesting a future where AI adapts to your unique communication style rather than forcing you to adapt to it.

4. Protecting Privacy While Learning

As AI becomes better at reading social signals, privacy concerns naturally arise. Our research team pioneered techniques that allow systems to understand emotions and social cues while obscuring identifying information about individuals.

Our privacy-preserving methods can, for example, analyze the emotional content of speech while making it impossible to identify who’s speaking.

This isn’t just a technical achievement—it’s a principled stance that social AI must be built with privacy as a core design principle, not an afterthought.

Real-World Impact

The implications of this research extend far beyond academic curiosity. We’re looking at AI systems that could:

Improve human-robot collaboration: Robots that can work alongside humans by understanding not just what we say, but how we feel and what we need
Transform healthcare: Imagine AI that can detect early signs of depression or anxiety through subtle changes in speech patterns and facial expressions
Revolutionize education: Virtual tutors that adapt their teaching style based on a student’s emotional state and engagement level
Enhance accessibility: Assistive technologies that help people with social communication challenges navigate complex interpersonal situations

The Path Forward

Looking ahead, I envision future researchers will build on our work to achieve comprehensive full-body behavior analysis, integrating hand gestures and posture with facial expressions and vocal cues. This will enable us to explore how these social intelligence capabilities might merge with large language models—the technology behind systems like ChatGPT—to create truly holistic AI interaction.

A New Era of Human-AI Interaction

Instead of systems that simply process explicit commands, we’re moving toward AI that understands the full richness of human communication: spoken words, unspoken feelings, and everything in between.

As AI becomes increasingly woven into the fabric of daily life, our collaborative research offers a roadmap for creating machines that don’t just compute—they connect.

About the Research: This doctoral work has been published in leading venues including ECCV, ACM Multimedia, IEEE Transactions on Affective Computing, and INTERSPEECH, representing collaborative contributions that are establishing new foundations in the field of socially intelligent AI systems.

Tran* and Yin* et al. “SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision” IEEE Transactions on Affective Computing 25.
Tran et al. “A framework for adaptation of exocentric Video MAE for egocentric social role understanding” ECCV 24.
Tran* and Chang* et al. “Dyadic Interaction Modeling for Social Behavior Generation” ECCV 24.
Tran et al. “SAAML: A framework for semi-supervised affective adaptation via metric learning” ACM Multimedia 23.
Tran* and Yin* et al. “Personalized Adaptation with pre-trained speech encoders for continuous emotion recognition”INTERSPEECH 23.
Tran et al. “Privacy-preserving Representation Learning for Speech Understanding” INTERSPEECH 23.
Tran et al. “A speech representation anonymization framework via selective noise perturbation”, ICASSP 23.
Tran et al. “A Pre-trained audio-visual transformer for emotion recognition” ICASSP 22.
Kontogiorgos* and Tran* et al. “A systematic cross-corpus analysis of human reactions to robot conversational failures” ICMI 21.
Tran et al. “Modeling dynamics of facial behavior for mental health assessment” FG 21.

Back