Research in Computer Vision: Human Motion and Interaction Modeling

Published: September 17, 2025

BYLINE: Di Chang, PhD Computer Science Student, USC Viterbi School of Engineering; Graduate Research Assistant, Intelligent Human Perception Lab, ICT

I began my PhD at USC’s Institute for Creative Technologies in 2022, working with Professor Mohammad Soleymani at the Intelligent Human Perception Lab. My decision to join ICT was straightforward: the research focus on computer vision for digital and virtual humans aligned with my interests, and Professor Soleymani’s approach to the field was compelling.

At the beginning of the second year of my undergraduate studies, I started my research on solving the technical challenges in computer vision and image processing. Before arriving at USC, I also had research experience at institutions including EPFL with Prof. Sabine Süsstrunk and Prof. Tong Zhang, TUM with Prof.Matthias Nießner, and HKUST with Prof. Dan Xu, working on multi-view stereo, neural rendering, and 3D vision. These experiences not only gave me the opportunity to publish my work at top-tier conferences, e.g. GBi-Net (CVPR 2022), RC-MVSNet (ECCV 2022), but also improved my academic communication skills and built connections with other talented researchers in the community from all over the world.

Currently, my research spans three areas: Human Video Animation, Human Conversation and Interaction Generation, and Image and Video Understanding and Editing. These domains address different aspects of the same technical problem: developing systems that can understand and generate realistic human behavior.

Human Video Animation

In human video animation, I have developed several systems for pose and motion generation. MagicPose, presented at ICML 2024, addresses realistic human pose and facial expression retargeting using identity-aware diffusion models. The system maintains individual characteristics while generating natural movement patterns.

X-Dyna, selected as a highlight paper at CVPR 2025, extends dynamic human image animation capabilities. The system provides control over human motion in video, generating animations from static images with improved fidelity compared to previous methods. X-Dancer, a highlight paper at ICCV 2025, focuses specifically on dance animation, capturing the specific movement patterns and timing requirements of dance.

These projects require attention to the technical details that produce realistic motion. Each system must handle the complexities of human movement: joint constraints, temporal consistency, and the preservation of individual characteristics across frames.

Human Interaction and Conversation Generation

Modeling human interactions presents additional complexity beyond single-person animation. The work on Dyadic Interaction Modeling, published at ECCV 2024, examines realistic social behavior generation between two individuals. This research is necessary for virtual humans that engage in natural conversations.

DiTaiListener, presented at ICCV 2025, generates controllable listener videos using diffusion models. The system produces appropriate responses to speakers through head movements, facial expressions, and body language. The technical challenge involves generating contextually appropriate and socially natural responses.

Diffportrait3D, a CVPR 2024 highlight, contributes to realistic human representation for interactive applications. These projects reflect the requirement for AI systems that understand communication beyond verbal content, including gestures, expressions, and behavioral patterns.

Image and Video Understanding and Editing

The third area of my research focuses on image and video understanding and editing. ByteMorph, submitted to NeurIPS 2025, benchmarks instruction-guided image editing with non-rigid motions. This work addresses editing images and videos while preserving natural physics and human movement.

VLM4D, presented at ICCV 2025, extends vision-language models to temporal understanding, enabling more sophisticated video content manipulation. These projects combine natural language processing and computer vision to create systems that execute complex editing instructions expressed in natural language.

Industry Experience and Applications

My research has been informed by industry experience through internships at TikTok’s Intelligent Creation Lab, ByteDance Seed’s Image Generation Team, and currently Meta’s SuperIntelligence Lab and Reality Lab Research. These positions have provided insight into how research translates to practical applications and the requirements for scalable systems.

Industry work has emphasized the importance of developing foundation models with broad applicability. Whether for social media content creation or professional tools, the underlying technical challenges remain consistent: creating AI systems that enhance rather than replace human creativity.

Academic Presentations and Future Directions

I have presented work at conferences including ECCV, CVPR, ICML, and ICCV. Each presentation has provided feedback that influences future research directions. I am interested in presenting at Siggraph, Siggraph Asia, or ICLR in the future.

My career objective is to continue AI research in an industry setting. I am seeking research scientist positions at organizations such as OpenAI, Meta SuperIntelligence, Google DeepMind, XAI, or Anthropic, where I can continue developing foundation models for human behavior understanding and generation.

A significant milestone would be receiving a best paper award at a top-tier computer vision conference. More importantly, I want to see research contributions that have a measurable impact on digital content creation and human-computer interaction.

Current and Future Work

I am preparing for a visiting scholar position at Stanford University’s Computational Imaging Lab with Professor Gordon Wetzstein, beginning in spring 2026. This collaboration will explore new directions in world models and reinforcement learning.

The intersection of these fields may provide new approaches to digital human research. The technical challenges in creating AI systems that understand human complexity continue to drive the work, and each project builds incrementally toward more sophisticated understanding and generation capabilities.

Outside of research, I have developed an interest in car racing, visiting tracks including Buttonwillow, Laguna Seca, and Sonoma Raceways. This provides a different type of focus that complements research work.

Technical Challenges and Perspectives

The field of AI continues to evolve rapidly, but the core technical challenges in human understanding remain. Creating systems that process not just information but understand the complexity of human experience requires continued advancement in multiple areas: computer vision, machine learning, and human-computer interaction.

My work at ICT, under Professor Soleymani’s guidance, continues to address these challenges through systematic research. Each project contributes to the broader goal of developing more sophisticated AI systems for human behavior understanding and generation.

The development of human-AI interaction capabilities requires sustained technical work across multiple domains. My research at USC’s Institute for Creative Technologies contributes to this ongoing effort.

Back