Language Models with Body Language: Advancing Gesture Selection for Virtual Humans

Published: June 18, 2025

As part of our summer research series, spotlighting early-career scientists advancing the frontiers of human–AI interaction, we’re pleased to share a new essay by Parisa G. Torshizi, PhD student at Northeastern University. Torshizi is currently part of our visiting scholars (intern) program, working between the Integrated Virtual Humans Lab (research lead: Arno Hartholt) and the Virtual Human Therapeutics Lab (VHTL), under its Director, Sharon Mozgai.

BYLINE: Parisa G. Torshizi, PhD student, Northeastern University

The expressive power of human gesture—its capacity to encode meaning, frame intent, and regulate interaction—remains one of the most essential yet understudied dimensions of communication in artificial agents. As embodied conversational systems become increasingly prevalent in education, healthcare, and virtual collaboration,it is crucial that these agents exhibit not only linguistic competence but also human-like nonverbal behavior. The question is no longer whether gestures matter, but how their selection and execution can be systematically modeled, scaled, and tailored to virtual contexts.

My recent work, Large Language Models for Virtual Human Gesture Selection (Parisa Ghanad Torshizi, Laura B. Hensel, Ari Shapiro, Stacy C. Marsella) addresses this challenge by proposing a novel use case of large language models (LLMs): not to generate gesture motion directly, but to serve as semantic selectors—inferring which gestures are meaningful, contextually appropriate, and aligned with a speaker’s role and discourse structure. This research was recognized with the Best Student Paper Award at AAMAS 2025 and reflects an ongoing commitment to bridging natural language processing, embodied cognition, and affective computing in the service of intelligent virtual agents.

Rethinking the Gesture Pipeline: Addressing challenges in prior work

Traditional architectures for gesture generation in virtual humans typically follow one of two paradigms. Data-driven systems learn statistical correspondences between speech and gesture from large corpora, enabling automated synthesis but often resulting in generic, averaged behaviors that lack situational or role-specific nuance. In contrast, rule-based systems support handcrafted annotation of utterances, allowing for greater control and interpretability, but at the cost of scalability and significant authoring effort.

Critically, both paradigms tend to conflate gesture generation (the articulation of movement) with gesture selection (the communicative decision of what to gesture and when). The former is a matter of kinematics and animation; the latter is a question of discourse, meaning, and role expression. The research presented in this paper isolates gesture selection as an independent, semantically tractable problem—and positions LLMs as a computational substrate for reasoning about it.

Core inquiry: Do LLMs Encode Gestural information?

At the heart of this study is an inquiry into the latent capabilities of LLMs like GPT-4: specifically, whether such models encode knowledge about human gesture types, their communicative functions, and their deployment across different roles and contexts. Drawing on established classifications—deictic, beat, iconic, metaphoric—we prompted GPT-4 to articulate distinctions among these gesture categories. The model not only responded accurately but exhibited a degree of theoretical fluency, referencing foundational image schemas (e.g., Source–Path–Goal, Container) central to embodied metaphor theory.

This suggests that LLMs, although not embodied themselves, possess a learned abstraction of gesture grounded in their exposure to multimodal discourse representations across text corpora. While they cannot observe motion, they can infer gestural salience from language patterns, narrative framing, and role-specific conventions.

Prompting LLMs: Selecting Meaningful, Role-Aligned Gestures

Building upon this hypothesis, we designed a series of prompting experiments to test whether GPT-4 could suggest co-speech gestures aligned with a given utterance’s meaning and speaker role. Different prompting paradigms were evaluated: From open-ended prompts, where the model generated gesture suggestions without constraints, to constrained prompts, which restricted the model to select from a predefined set of gestures, by providing prior information on gestures.

In all of the approaches, the semantic appropriateness of the gestures was very impressive. The open-ended approach proved highly generative, yielding creative and semantically coherent gestures—particularly useful for interactive design workflows. Constrained prompting, by contrast, enabled role consistency and reproducibility, making it preferable for automated implementations where behavioral coherence across utterances is critical.

Prompting LLMs: Timing Gestures with Theme-Focused Precision

In addition to gesture type, temporal alignment is a key component of effective gestural behavior. Building on linguistic research indicating that gestures tend to co-occur with the rheme—the information and emphasis focus of a sentence—we evaluated GPT-4’s ability to perform a form of discourse analysis, the theme–rheme segmentation. The model’s annotations were compared against a corpus of co-speech gestures. Remarkably, most gestures occurred within the rheme segments as identified by GPT-4.

This result not only validates the model’s discourse-level reasoning but also provides a practical mechanism for gesture timing. By aligning gestures with rheme segments, designers and systems can narrow down the selection of gestures to gestures that are most relevant to the speaker’s intended meaning.

From Research to Integration: Toward Expressive Virtual Agents

Although this research does not address gesture synthesis directly, it introduces a principled architecture for semantically grounded gesture selection. This selection module has been integrated into SIMA—the Socially Intelligent Multimodal Agent—our virtual human framework. In this pipeline, utterances are analyzed for rheme–theme structure, role identity is specified, and GPT-4 is prompted to generate candidate gestures. These selected gestures are then passed to the animation system for synthesis.

This modular design frames gesture selection as a controllable and explainable process, enabling designer control over gestural performance, flexibility in selecting which gestures the virtual human should use, and how those gestures are conveyed—depending on the application’s needs.

Next Steps: Multimodal Expansion and Real-Time Constraints

Several avenues for future research emerge from this foundation. First is the extension of this gesture selection framework to incorporate additional nonverbal modalities—gaze, posture, head movements, and facial expression—each of which interacts with speech and gesture in complex, dynamic ways. Modeling these behaviors jointly will require multimodal reasoning and coordination strategies across modalities.

Second, we aim to implement real-time gesture selection through either retrieval-augmented generation (RAG), domain-specific fine-tuning of smaller models, or architectural simplification to ensure computational feasibility. While this study examined role influence in gesture selection, there remains substantial opportunity to model individual variation, sociocultural norms, and adaptive gestural strategies. Enabling designer control over these parameters will support use cases across health, education, diplomacy, and entertainment.

Concluding Reflections

This work contributes to a growing body of research that challenges disembodied conceptions of language in AI systems. While text alone can simulate many aspects of human reasoning, embodied communication demands a more holistic, socially situated perspective. Gesture is not an accessory to speech—it is co-expressive, integral to persuasion, trust-building, and cognitive grounding.

That a large language model can meaningfully participate in gesture selection—without ever having seen a moving body—speaks to the richness of linguistic corpora as carriers of embodied knowledge. It also invites a rethinking of where the boundaries lie between semantics, simulation, and social intelligence.

By isolating gesture selection as a semantic task and demonstrating that LLMs can perform it with sensitivity to role, discourse, and timing, this research offers a new methodological path for designing expressive, communicative, and socially aware virtual agents.

Back