Vector sketch animation generation with differentialable motion trajectories
Xinding Zhu, Xinye Yang, Shuyang Zheng, Zhexin Zhang, Fei Gao, Jing Huang, Jiazhou Chen
TL;DR
This work tackles temporal coherence in automatic vector sketch animation by modeling stroke motion with Differentiable Motion Trajectory (DMT), a differentiable polynomial-based representation. DMT uses a Bernstein basis to stabilize optimization and enables global semantic gradient flow across frames, producing high-framerate vector animations that scale to long videos. The approach demonstrates superior performance over state-of-the-art methods on DAVIS and LVOS and remains robust in cross-domain settings such as text-to-video and 3D animation. The paper also provides theoretical support for polynomial representation of motion trajectories and empirically compares fitting strategies to ensure stable, scalable initialization, with future directions including efficiency, tracking robustness, and broader multimedia applications.
Abstract
Sketching is a direct and inexpensive means of visual expression. Though image-based sketching has been well studied, video-based sketch animation generation is still very challenging due to the temporal coherence requirement. In this paper, we propose a novel end-to-end automatic generation approach for vector sketch animation. To solve the flickering issue, we introduce a Differentiable Motion Trajectory (DMT) representation that describes the frame-wise movement of stroke control points using differentiable polynomial-based trajectories. DMT enables global semantic gradient propagation across multiple frames, significantly improving the semantic consistency and temporal coherence, and producing high-framerate output. DMT employs a Bernstein basis to balance the sensitivity of polynomial parameters, thus achieving more stable optimization. Instead of implicit fields, we introduce sparse track points for explicit spatial modeling, which improves efficiency and supports long-duration video processing. Evaluations on DAVIS and LVOS datasets demonstrate the superiority of our approach over SOTA methods. Cross-domain validation on 3D models and text-to-video data confirms the robustness and compatibility of our approach.
