Transformer-based Neuro-Animator for Qualitative Simulation of Soft Body Movement
Somnuk Phon-Amnuaisuk
TL;DR
The paper addresses qualitative prediction of soft-body motion by introducing a visual transformer-based neuro-animator that predicts the next-frame 3D positions $P^{t+1}$ from the past sequence $[P^{t-n},...,P^t]$. It treats 3D particle trajectories from an 11-by-11 flag grid as tokens, embedding them into a trajectory-centric representation processed by a transformer with eight attention heads across eight layers, and trained with a robust Huber loss. The training data are generated from a mass-spring cloth simulation under gravity and three wind strengths, yielding approximately 15,000 sequences of length 64 across 121 trajectories. Results show learned temporal embeddings and plausible flag-waving under wind, though there remains room to improve motion naturalness and realism. This work demonstrates a memory-driven, qualitative visualization approach that can approximate dynamic physics without explicit numerical simulations, with potential applications in rapid qualitative forecasting of soft-body motion.
Abstract
The human mind effortlessly simulates the movements of objects governed by the laws of physics, such as a fluttering, or a waving flag under wind force, without understanding the underlying physics. This suggests that human cognition can predict the unfolding of physical events using an intuitive prediction process. This process might result from memory recall, yielding a qualitatively believable mental image, though it may not be exactly according to real-world physics. Drawing inspiration from the intriguing human ability to qualitatively visualize and describe dynamic events from past experiences without explicitly engaging in mathematical computations, this paper investigates the application of recent transformer architectures as a neuro-animator model. The visual transformer model is trained to predict flag motions at the \emph{t+1} time step, given information of previous motions from \emph{t-n} $\cdots$ \emph{t} time steps. The results show that the visual transformer-based architecture successfully learns temporal embedding of flag motions and produces reasonable quality simulations of flag waving under different wind forces.
