Table of Contents
Fetching ...

DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery

Zan Wang, Siyu Chen, Luya Mo, Xinfeng Gao, Yuxin Shen, Lebin Ding, Wei Liang

TL;DR

DogMo introduces a large-scale, multi-view RGB-D dataset for 4D canine motion recovery, featuring 1.2k motion sequences from 10 dogs across 11 actions and 5 synchronized cameras. The authors propose a three-stage instance-specific optimization that fits the D-SMAL model to multi-view and RGB-D inputs, leveraging losses such as Chamfer Mask, CSE Keypoints, Leg Cross, and temporal regularization to progressively refine shape, pose, and motion. Four benchmark settings spanning monocular/multi-view and RGB/RGB-D inputs enable systematic evaluation of dog motion recovery, and experiments show improvements over baselines like SMALify and AnimalAvatar, with notable gains when depth and multiple views are available. The work highlights the potential of combining robust 3D priors, dense correspondences, and temporal consistency to enable accurate, plausible 4D dog motion reconstruction, with implications for animation, VR/AR, and animal behavior modeling.

Abstract

We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.

DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery

TL;DR

DogMo introduces a large-scale, multi-view RGB-D dataset for 4D canine motion recovery, featuring 1.2k motion sequences from 10 dogs across 11 actions and 5 synchronized cameras. The authors propose a three-stage instance-specific optimization that fits the D-SMAL model to multi-view and RGB-D inputs, leveraging losses such as Chamfer Mask, CSE Keypoints, Leg Cross, and temporal regularization to progressively refine shape, pose, and motion. Four benchmark settings spanning monocular/multi-view and RGB/RGB-D inputs enable systematic evaluation of dog motion recovery, and experiments show improvements over baselines like SMALify and AnimalAvatar, with notable gains when depth and multiple views are available. The work highlights the potential of combining robust 3D priors, dense correspondences, and temporal consistency to enable accurate, plausible 4D dog motion reconstruction, with implications for animation, VR/AR, and animal behavior modeling.

Abstract

We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.

Paper Structure

This paper contains 62 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements using $5$ synchronized low-cost cameras. DogMo contains $1.2k$ motion sequences from $10$ unique dog instances, spanning $11$ action types and totaling over $220$ minutes of recordings.
  • Figure 2: Motion Capture. We employ $5$ synchronized low-cost RGB-D cameras to record $10$ individual dogs.
  • Figure 3: Overview of DogMo.DogMo provides five-view RGB-D image sequences for each motion recording, along with corresponding instance masks, keypoints, and semantic (i.e., action label) annotations.
  • Figure 4: Optimization pipeline. Our optimization pipeline includes three stages, progressively refining the dog's body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization.
  • Figure 5: Qualitative results. Each benchmark setting shows two results, with three sampled frames per result from left to right. Three rows show the input image, cse model projection, and recovered mesh, respectively. In multi-view settings, we show only one input view for clarity and visualize the mesh from two angles.
  • ...and 6 more figures