Table of Contents
Fetching ...

Continuous Piecewise-Affine Based Motion Model for Image Animation

Hexiang Wang, Fengqi Liu, Qianyu Zhou, Ran Yi, Xin Tan, Lizhuang Ma

TL;DR

This work addresses motion transfer for image animation when driving motions exhibit large displacements by modeling motion with Continuous Piecewise-Affine (CPAB) velocity fields, offering higher expressiveness than affine or TPS transforms. It learns CPAB parameters from keypoint pairs via gradient-descent inference and reinforces semantic and structural consistency through a SAM-based keypoint semantic loss and a DINO ViT-based structure alignment loss. The approach demonstrates strong quantitative and qualitative performance across four diverse datasets, with ablations confirming the gains from global CPAB integration and the two auxiliary losses. The method advances unsupervised image animation by enabling more faithful motion transfer while preserving source identity, with public code to facilitate adoption and benchmarking.

Abstract

Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM.

Continuous Piecewise-Affine Based Motion Model for Image Animation

TL;DR

This work addresses motion transfer for image animation when driving motions exhibit large displacements by modeling motion with Continuous Piecewise-Affine (CPAB) velocity fields, offering higher expressiveness than affine or TPS transforms. It learns CPAB parameters from keypoint pairs via gradient-descent inference and reinforces semantic and structural consistency through a SAM-based keypoint semantic loss and a DINO ViT-based structure alignment loss. The approach demonstrates strong quantitative and qualitative performance across four diverse datasets, with ablations confirming the gains from global CPAB integration and the two auxiliary losses. The method advances unsupervised image animation by enabling more faithful motion transfer while preserving source identity, with public code to facilitate adoption and benchmarking.

Abstract

Image animation aims to bring static images to life according to driving videos and create engaging visual content that can be used for various purposes such as animation, entertainment, and education. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. However, limited by the expressive power of the transformations used, these methods always produce poor results when the gap between the motion in the driving frame and the source image is large. To address this issue, we propose to model motion from the source image to the driving frame in highly-expressive diffeomorphism spaces. Firstly, we introduce Continuous Piecewise-Affine based (CPAB) transformation to model the motion and present a well-designed inference algorithm to generate CPAB transformation from control keypoints. Secondly, we propose a SAM-guided keypoint semantic loss to further constrain the keypoint extraction process and improve the semantic consistency between the corresponding keypoints on the source and driving images. Finally, we design a structure alignment loss to align the structure-related features extracted from driving and generated images, thus helping the generator generate results that are more consistent with the driving action. Extensive experiments on four datasets demonstrate the effectiveness of our method against state-of-the-art competitors quantitatively and qualitatively. Code will be publicly available at: https://github.com/DevilPG/AAAI2024-CPABMM.
Paper Structure (18 sections, 14 equations, 7 figures, 2 tables)

This paper contains 18 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our framework animates static image according to driving video via Continuous Piecewise-Affine velocity fields (inset), which divide the image space into a series of small grids (tessellation), and express the motion transfer from driving image to source image using independent affine transformations within each grid.
  • Figure 2: Overview of our method. The Keypoint Detector predicts $N$ sets of keypoints for source and driving images, from which we generate $N$ local CPA spaces with our Keypoint-Based Transformation Inference algorithm. Especially, we combine $N$ sets of keypoints and generate a global CPA space with all the keypoints. Then we integrate the $N+1$ CPA spaces and obtain $N+1$ CPAB transformations. Along with background transformation, the $N+2$ transformations are combined by the Dense Motion Net to generate several occlusion maps and dense optical flow. Finally, we feed the source image with dense motion results into Inpainting Net to generate the target frame.
  • Figure 3: Visualization examples of type-II tessellation which divides image space into squares of same size.
  • Figure 4: The structure of Keypoint Semantic Extractor. The image encoder, keypoint encoder and transformer decoder are extracted from pre-trained SAM kirillov2023segment.
  • Figure 5: Video reconstruction task: some bad cases generated by TPSMM, while our method performs better.
  • ...and 2 more figures