On Understanding Humans Through Machines: A Journey into Multimodal Affective Computing

Published: June 17, 2025

Kevin Hyekang Joo is an AI researcher at the USC Institute for Creative Technologies (USC ICT), and a PhD student in the Thomas Lord Department of Computer Science at the University of Southern California, under the supervision of Prof. Mohammad Soleymani. His research focuses on analyzing and modeling nuanced human communicative behaviors and affective states using multimodalities. Joo is supported by the USC Annenberg Fellowship, and recently presented “Signals of Closeness: a Multimodal Analysis of Self-Disclosure and Engagement” at the USC Annenberg Graduate Fellowship Research and Creative Project Symposium 2025.

BYLINE: By Kevin Hyekang Joo, PhD student, Computer Science, Viterbi School of Engineering, USC; AI Researcher, {Intelligent Human Perception Lab, Affective Computing Group}, USC Institute of Creative Technologies

From an early age, I’ve been drawn to what most consider a paradox: understanding the messiness of human behavior through the precision of machines. While my peers were modifying games or optimizing code to maximize efficiency, I found myself asking whether a machine could ever recognize something as subtle as distracted thought—or more ambitiously, unspoken emotion.

This question, seemingly simple at first, became the inception of my research trajectory. What began as an undergraduate project to detect distracted driving behavior evolved into a focused scientific pursuit: modeling nuanced human communicative behaviors through multimodal machine learning. Today, as a PhD student in Computer Science at the University of Southern California, working under Professor Mohammad Soleymani at the USC Institute for Creative Technologies (ICT), I am developing multimodal foundation models designed to robustly and dynamically infer nuanced affective and cognitive states from subtle human communicative behaviors. My work emphasizes interpretability and context-awareness, incorporating psychological insights to enhance the effectiveness and depth of computational representations.

From Curiosity to Commitment

During my undergraduate and Master’s studies at the University of Maryland, I immersed myself deeply in machine learning and computer vision. During this time, I had the privilege of working closely with five professors – across campus labs and an NSF REU – and their graduate students; each of them kindly offered me unique opportunities to engage in hands-on projects across diverse areas of computer science, all of which I am grateful to have explored.

These experiences helped me narrow my focus to artificial intelligence, and more specifically, to computer vision. Through these, I was able to publish several papers as a first or second author [1, 2, 3, 4]. Yet, my intellectual compass consistently pointed toward the intersection where computation meets cognition. Given the outstanding expertise of the University of Maryland in computer vision, I was often mentored by researchers in vision whose labs gave me the rigor and tools to explore visual perception computationally. But over time, I began to realize that pixels alone, while powerful, cannot fully capture the subtlety and intent behind human communication. A smile might express joy in one culture and discomfort in another; eye contact might suggest connection or signal a threat.

This realization guided my transition into affective computing, a domain that combines computer science, psychology, and cognitive science to create systems that can recognize, interpret, and process the insides and outsides of humans. It was here that I also found the intellectual and ethical urgency I had been seeking. Affective computing is not simply about teaching machines to recognize emotions computationally into seven basic emotions; it is about enabling machines to engage meaningfully with the human experience, interpreting human behaviors within rich social contexts, and ultimately fostering technology that respects and responds to the complexity of what it means to be human.

Why Multimodality?

Human communication is inherently multimodal. Yet, in isolation, visual, audio, and language-based cues offer weak and incomplete signals. Consider a person who nods but expresses hesitation through vocal inflection. Or someone whose verbal affirmation is contradicted by an averted gaze. These contradictions are not anomalies – they are the norm in human communication.

Hence, my research adopts a multimodal approach: integrating visual, acoustic, and linguistic signals to build more robust models of human communicative behavior understanding. Throughout my journey, I have had the opportunity to push the boundaries of this methodology through several concurrent projects.

One project [5, 6] I worked on explores how large language models can be used to interpret human interaction when grounded in behavioral context. By constructing structured multimodal transcripts – fusing smartglass-derived signals such as gaze, facial expressions, and spoken language (See below) – the research examines how prompting language models with these enriched inputs enables more socially attuned reasoning about engagement. The approach reflects an emerging interest in treating language not as a standalone modality, but as one deeply interwoven with the temporal and embodied nature of real-world communication.

This direction complements another effort presented at the Annenberg Symposium 2025 [7], which investigates how individuals build connections through self-disclosure.

Drawing on Social Penetration Theory, the research study examines how the breadth and depth of personal sharing, across vocal, visual, and linguistic cues, influence perceptions of intimacy and mutual engagement. Together, these projects signal a shift toward modeling human communication not as a static exchange of signals, but as a layered, evolving process shaped by subtle shifts in attention, expression, and trust.

Another project currently under submission [8] investigates the behavioral correlates of dyadic alliance in naturalistic multiparty support group settings using multimodal machine learning. The goal is to examine how observable behavioral cues, such as facial affect, head pose, and speech dynamics, are associated with the emergence of alliance between individuals. By modeling interactions at both the speaker and listener level, the study seeks to uncover how subtle, bidirectional behaviors – like attentive listening or reciprocal expressions – contribute to perceived connection.

Context Is Key

While multimodality is essential, it alone is insufficient. Human affect and intent are inherently contextual. Sarcasm, for example, is not a property of tone alone, nor text alone, but their interplay in a specific socio-linguistic moment. Building systems that can decode such nuance requires architectures that are both generative and discriminative – models that can reason about behavior in time, across modalities, and within social contexts.

I believe context-aware modeling will be pivotal in domains like digital therapy, education, and even driver safety – domains where understanding why someone acts is as important as what they do.

Reflections on the Path

As I conclude my first year in the PhD program, I find myself reflecting deeply on the nature of meaningful scientific inquiry. The pace of research can often be disorienting: new benchmarks, emerging datasets, shifting paradigms. Yet, I have come to appreciate that intellectual maturity is not just about acceleration, but precision. What does it mean to ask the right question? To propose a hypothesis that is ambitious, testable, and grounded in theory?

This summer, I’m focusing on strengthening those meta-skills: critical reasoning, systematic experimentation, and scientific storytelling. The ability to communicate complex ideas clearly isn’t just a bonus – it’s central to doing impactful science.

I’m privileged to be part of an inclusive and vibrant community at USC ICT. Working with my advisor, Prof. Mohammad Soleymani, has pushed me to engage deeply with how different modalities – visual, acoustic, linguistic – can be integrated meaningfully, rather than merely fused. Through our work together, I’ve come to appreciate the methodological rigor that comes from computer vision and signal processing, and the analytical depth that emerges when these are anchored in behavioral context. The challenge isn’t only in detecting patterns across modalities – it’s in interpreting those patterns with sensitivity to the human experiences they reflect.

I also want to appreciate Prof. Jonathan Gratch’s mentorship, as occasional engagement with his lab’s work deepened my appreciation for the psychological richness of human communication.

Beyond the Lab

Outside research, I remain fascinated by pattern discovery in other forms – nonograms, in particular, offer a meditative engagement with constraint-based logic. What draws me in is their visual texture: each solution is composed of big square pixels, rendered on a human scale. The habits may seem orthogonal to my research, but they provide the equilibrium I need to pursue long-term research with clarity and intent.

Closing Thought

Through my research, I aspire to create systems that not only recognize behaviors but interpret them meaningfully, embracing the layered complexities of human expression. This vision requires more than technical advancement; it calls for a commitment to nuance, to context, and to a human-centered understanding of intelligence.

As I move forward, I carry with me the same question that first drew me to this field: How can machines make sense of people – not just what we do or say, but why? My hope is that by anchoring this question in both rigorous science and thoughtful design, I can contribute to technologies that don’t replace human understanding, but extend it – with empathy, with integrity, and with purpose.

Citations

[1] Khoa Vo, Hyekang Kevin Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, Ngan Le. “AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation.” The British Machine Vision Conference (BMVC), 2021 (Oral Presentation).

[2] Junran Yang, Hyekang Kevin Joo, Sai S. Yerramreddy, Siyao Li, Dominik Moritz, Leilani Battle. “Demonstration of VegaPlus: Optimizing Declarative Visualization Languages.” ACM Special Interest Group on Management of Data (SIGMOD) Conference, 2022.

[3] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan Le. “CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection.” IEEE International Conference on Image Processing (ICIP), 2023 (Oral Presentation).

[4] Junran Yang, Hyekang Kevin Joo, Sai S. Yerramreddy, Dominik Moritz, Leilani Battle. “Optimizing Dataflow Systems for Scalable Interactive Visualization.” ACM Special Interest Group on Management of Data (SIGMOD) Conference, 2024.

[5] Kevin Hyekang Joo*, Cheng Ma*, Alexandria Vail*, Sunreeta Bhattacharya, Alvaro Garcia, Kailana Baker-Matsuoka, Sheryl Mathew, Lori Holt, Fernando De la Torre. “Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation.” Under Review, 2025.

[6] Sunreeta Bhattacharya, Hyekang K. Joo, Alvaro Garcia, Cheng Ma, Yiming Fang, Fernando De la Torre, Lori Holt. “Tracking Engagement in Real-World Human Communication from Wearable Sensors.” The 21st Symposium of Advances and Perspectives in Auditory Neuroscience (APAN), 2023.

[7] Natalie Kim, Kevin Hyekang Joo*, Eugene Lee*. “Signals of Closeness: A Multimodal Analysis of Self-Disclosure And Engagement.” USC Annenberg Symposium, 2025.

[8] Kevin Hyekang Joo et al. xxxxxx. Under Review, 2025.

Back