In our latest essay spotlighting ICT’s graduate students, Research Engineer Soham Hans previews work on using LLMs to shape scenario generation tools ahead of his presentation at the International Workshop on Engineering Multi-Agent Systems (EMAS 2025)
BYLINE: Soham Hans, Research Engineer, ICT
After receiving my Masters in Computer Science from USC Viterbi School of Engineering, I joined ICT, to work under the supervision of Dr. Volkan Ustun in the Human-inspired Adaptive Teaming Systems (HATS) Lab. My research focus today includes multi-agent systems, procedural content generation, multi-modal LLM, multi-agent LLM, and multi-step reasoning – essentially making engineering systems with intelligence that reason, adapt, and collaborate.
In our latest project, “A Multi-Agent Collaborative Reasoning Framework for Generating Physics Puzzles” (co-authored with Binze Li, and Volkan Ustun) which I am presenting at the 2025 International Workshop on Engineering Multi-Agent Systems (EMAS) (19 – 23, May 2025), we explore how large language models (LLMs) can be marshalled into a multi-agent reasoning framework capable of generating complex 2D physics puzzles. This project emerged as part of a broader line of research into how LLMs can be used for scenario generation. These puzzles are not ends in themselves. Rather, they serve as testbeds for broader questions: How can we endow generative models with the capacity for iterative, spatially grounded reasoning? And how might this advance automated scenario generation in high-stakes environments, such as military training?
Why Puzzles?
The Physics Puzzle environment we work in—CREATE (Chain REAction Tool Environment)—offers an elegant microcosm of the larger challenges inherent in simulation-based training. In CREATE, an agent must place tools (ramps, springs, and more) to guide a red ball to a goal, often with the help of another blue ball. It is a deceptively simple premise that demands a surprisingly sophisticated understanding of spatial relationships, physical causality, and multi-step planning. These are precisely the same ingredients required to build rich, adaptive training simulations for domains like defense, where expert-level decision-making often hinges on the nuances of terrain, timing, and team coordination.
Scenario generation for such domains has traditionally been manual—time-consuming, rigid, and reliant on rare human expertise. Our goal is to change that. We are developing a framework in which generative models can synthesize meaningful, solvable scenarios from natural language prompts, aided by structured content banks of prior training materials. This requires not just linguistic fluency from the models, but a kind of agentic cognition: the ability to deconstruct a goal into constituent tasks, reason through spatial dynamics, and iteratively refine outputs in response to feedback.
The Need for Multiple Minds
Contemporary LLMs are impressive, but they are still, in most deployments, singular agents operating in linear mode: prompt in, response out. This monolithic approach is ill-suited for tasks requiring long-horizon reasoning, continuous context updating, and the synthesis of multiple expert viewpoints. In contrast, we take inspiration from how humans work in teams: through role specialization, feedback loops, and collaborative refinement.
Our framework therefore distributes responsibility across multiple agents. A central ReAct agent orchestrates the process, invoking specialized agents—such as a Solver, who focuses on ensuring that the puzzle is solvable, and a Designer, who tries to make the puzzle interesting and non-trivial. The system iteratively reasons and acts, receiving feedback from the environment (via simulation outputs) after each step. This architecture promotes adaptability, modularity, and robustness—qualities that are often missing from single-agent LLM deployments.
Spatiality and the Limits of Language
Early experiments highlighted one of the key limitations of LLMs: an incomplete grasp of spatial relations. Basic agents often misidentified left from right or suggested tool placements that contradicted observed dynamics. Even with powerful models like GPT-4o, we found that visual information alone was insufficient. These models, trained primarily on text, lacked a stable internal representation of spatial geometry. This led us to augment visual inputs with textual feedback describing trajectories, positions, and directional cues in structured formats.
The ReAct prompting strategy was essential here. By embedding reasoning-action-observation loops within the prompt structure, we grounded the model’s generative capacity in environmental feedback. Still, in complex puzzles involving multiple tools and interdependencies, single ReAct agents faltered. They could reason effectively at a high level but struggled with execution. They failed, for example, to follow through on corrective steps even after multiple failed simulations.
It was the collaborative agent architecture that finally allowed us to close this gap. By decomposing cognitive responsibilities, we significantly reduced spatial confusion and increased the rate of solvable puzzle generations. However, this came with its own challenges. Even successful puzzles sometimes diverged from the user’s intent. The model might produce a technically correct solution, but one that violated scenario constraints or overlooked thematic goals. This underscored a critical insight: understanding how to solve a problem is not the same as understanding why the problem was posed in the first place.
Building Towards Real-World Scenarios
To bridge the gap between technical correctness and contextual alignment, we are now exploring the integration of an Evaluator agent—essentially a quality control mechanism that reviews generated puzzles for fidelity to user prompts and training objectives. The eventual aim is to support simulation-based training environments that can ingest open-ended instructions (“design a counterinsurgency scenario in mountainous terrain with limited visibility”) and generate nuanced, tactically relevant scenarios tailored to learner competencies.
In military training, the ability to adapt scenarios dynamically—to incorporate changes in objectives, geography, or team composition—is crucial. Past approaches to scenario generation, such as those relying on parametric sliders or cognitive task models, lacked the flexibility and contextual depth we now seek. LLMs promise a generative leap forward, provided we can anchor their outputs in structured reasoning frameworks that respect domain constraints.
Procedural Content Generation as Structured Reasoning
What we are developing is more than a puzzle generator. It is a step toward rethinking how generative models—particularly multi-modal large language models—can be organized into distributed reasoning systems rather than treated as isolated text predictors. This addresses the need for structured creativity: the ability to generate novel yet goal-aligned content within bounded domains.
The CREATE environment gives us a tractable, measurable space to test this kind of structured creativity. It reveals the frictions between language and embodiment, between declarative knowledge and procedural execution. It also shows the promise of multi-agent coordination in overcoming these frictions—enabling AI systems to not just talk about solutions, but to build, revise, and reason through them interactively.
What Comes Next
Looking forward, we are interested in fine-tuning models specifically for spatial reasoning tasks, potentially training on synthetic data generated within CREATE. We are also evaluating new model architectures specialized for reasoning, to assess improvements in both reasoning fidelity and spatial precision. Further, we are developing hybrid pipelines where LLM agents output relative instructions (“place ramp to the left of the blue ball”) while underlying code modules convert these into executable coordinates.
Each of these steps represents incremental progress toward a more general capability: the ability to generate structured environments from unstructured intent. And though puzzles may seem trivial, they are a proving ground for deeper competencies—pattern recognition, causal inference, constraint satisfaction—that underpin the design of any meaningful simulation.
Ultimately, the promise of AI is not just that it can answer questions, but that it can ask them—pose challenges, design environments, and co-create experiences. In building systems that reason together, we are also rethinking what it means for machines to imagine.
//