By Emily Yue-Ting Jia, PhD Computer Science Student, USC Viterbi School of Engineering; Graduate Research Assistant, ICT Vision and Graphics Lab (VGL) supervised by Dr. Yajie Zhao
In October, I will present my research at the International Conference on Computer Vision (ICCV) in Honolulu. ICCV is the field’s premier venue, where diverse strands of computer vision come together, and where ideas gain momentum by being tested against a global community of peers. My paper, Learning an Implicit Physical Model for Image-based Fluid Simulation (with Jiageng Mao, Zhiyuan Gao, Yajie Zhao, and Yue Wang), takes up a challenge at the border of perception and imagination: teaching machines to animate fluid motion from a single still image.
Humans do this instinctively. When looking at a photograph of a river, we anticipate how water will curl around a stone, how ripples will disperse, and how currents will shift with the terrain. This ability is so immediate that we hardly notice it. Yet replicating it in machines requires a synthesis of visual learning, 3D reconstruction, and physics—domains that are rarely integrated fully in existing approaches.
Why Fluids, Why Now
In recent years, computer vision has made great strides in generating video from static input. These advances have been powered by neural networks trained on vast datasets of natural imagery. Yet when applied to fluids—water, smoke, fire—the results are often unsatisfactory. Motions look plausible at a glance but quickly unravel: boundaries are ignored, matter flows through obstacles, or dynamics unfold in ways that defy the physics we implicitly expect.
Fluids are a revealing test case. They are ubiquitous in the natural world, highly sensitive to initial conditions, and governed by equations—the Navier–Stokes equations—that remain challenging even for numerical simulation. To animate fluids convincingly from limited input is therefore to confront the limits of both data-driven models and traditional physics-based simulation. My work proposes a middle path: physics-informed neural dynamics.
Method: Embedding Physics in Learning
The framework I will present consists of two main components. First, we represent the input scene using 3D Gaussians, which serve as a flexible particle-like representation that captures geometry and supports novel-view synthesis. Second, we employ a physics-informed neural dynamics module that predicts velocity fields directly from the input image.
This module is guided by two forms of supervision: data priors learned from real-world videos and loss terms derived from fluid dynamics. Specifically, we incorporate constraints informed by the Navier–Stokes equations. The combination allows the network to learn plausible motion patterns while avoiding the unphysical artifacts common to purely data-driven methods.
Once velocity fields are predicted, the 3D Gaussians can be displaced over time and rendered from any camera trajectory. The result is a four-dimensional reconstruction: geometry and motion, unfolding from what began as a single still photograph.
Evaluation and Findings
To test the method, we designed two kinds of studies. First, in controlled synthetic environments, where ground-truth velocity fields can be measured, our model significantly reduced error compared to existing baselines—achieving more than twenty percent improvement in key cases. Second, we carried out a perceptual evaluation. In a user study, participants were asked to compare videos generated by our method with those produced by prior approaches. Across forty test inputs, our results were preferred by human observers with a gain of roughly forty percent.
These findings confirm two things: embedding physics improves measurable accuracy, and it produces motion that aligns with human intuition about how fluids should behave. Both are essential if we hope to build systems that not only generate compelling images but also earn trust as reconstructions of reality.
Broader Context
This research belongs to a larger movement in computer vision: the convergence of physics and learning. For decades, the field has oscillated between handcrafted models rooted in physics and data-driven approaches that learn from examples. The future likely lies in integration. By respecting the structure imposed by physics while harnessing the adaptability of neural networks, we can build models that generalize better, adapt to novel scenarios, and remain interpretable.
Fluids are only the beginning. The same principles could extend to other dynamic phenomena: crowds, smoke, deformable objects. In robotics and autonomous systems, the ability to predict how a scene will evolve from a single image would provide a powerful anticipatory capability. In creative industries, physically aware animation opens new possibilities for digital content and visual effects.
Looking Ahead
The ICCV community is remarkably diverse, spanning geometry, learning, perception, and generation. My aim in presenting this work is not only to show progress on a specific problem, but also to invite dialogue across subfields. How can we better combine the rigor of physics with the flexibility of data-driven methods? What representations best capture the evolving structure of dynamic scenes? How can evaluation metrics bridge the gap between numerical accuracy and perceptual realism?
These questions cut across areas of vision research. Our contribution—physics-informed neural dynamics for fluid animation—provides one answer while inviting further exploration.
Conclusion
The task of animating a single still image might sound narrow, but it reveals deep challenges: how to balance data and law, perception and prediction, imagination and constraint. Our paper shows that when we embed physical principles into learning systems, we move closer to the human ability to see motion in stillness and to imagine the world unfolding from a single view.
That is what I will bring to Honolulu this October: a demonstration that integrating physics into neural networks not only improves technical performance but also aligns computation with the way humans expect the world to behave. It is a small but meaningful step toward vision systems that are not only generative but also grounded.
//