Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
Li Hu, Guangyuan Wang, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, Liefeng Bo
TL;DR
The paper addresses the lack of environment-aware fidelity in diffusion-based character animation by introducing Animate Anyone 2, which learns environment affordance from driving videos and populates environment regions with coherent characters.Key innovations include a shape-agnostic boundary mask for robust character-scene integration, an object guider with spatial blending to preserve interactive object dynamics, and a depth-wise pose modulation to support diverse motions.Experiments on diverse datasets demonstrate superior quantitative metrics and qualitative integration of characters with scenes and objects, surpassing prior methods and showing robustness to motion variety.The work enables more realistic character animation in complex environments with practical implications for filmmaking, advertising, and virtual character applications, while noting limitations such as hand-object interaction artifacts and reliance on segmentation tools.
Abstract
Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.
