Table of Contents
Fetching ...

CodingHomo: Bootstrapping Deep Homography With Video Coding

Yike Liu, Haipeng Li, Shuaicheng Liu, Bing Zeng

TL;DR

CodingHomo addresses unsupervised deep homography estimation under challenging motions by bootstrapping with motion vectors (MVs) derived from video coding. It introduces Mask-Guided Fusion (MGF) and Mask-Guided Homography Estimation (MGHE) to fuse MV priors into a coarse-to-fine warping framework, guided by an Enhanced Motion Mask $M_e$ computed from MVs and features. An unsupervised loss combining $\ell_{align}$, $\ell_{FIL}$, and $\ell_{plane}$ focuses learning on the dominant plane and suppresses outliers via a probabilistic MV-homography model. Empirically, CodingHomo achieves state-of-the-art performance on CA-unsup and strong generalization to GHOF, demonstrating robust, transferable homography estimation in real-world, dynamic scenes. The work highlights the practical value of compressed-domain cues for geometric estimation and provides detailed ablations and a public codebase to facilitate reproducibility.

Abstract

Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo

CodingHomo: Bootstrapping Deep Homography With Video Coding

TL;DR

CodingHomo addresses unsupervised deep homography estimation under challenging motions by bootstrapping with motion vectors (MVs) derived from video coding. It introduces Mask-Guided Fusion (MGF) and Mask-Guided Homography Estimation (MGHE) to fuse MV priors into a coarse-to-fine warping framework, guided by an Enhanced Motion Mask computed from MVs and features. An unsupervised loss combining , , and focuses learning on the dominant plane and suppresses outliers via a probabilistic MV-homography model. Empirically, CodingHomo achieves state-of-the-art performance on CA-unsup and strong generalization to GHOF, demonstrating robust, transferable homography estimation in real-world, dynamic scenes. The work highlights the practical value of compressed-domain cues for geometric estimation and provides detailed ablations and a public codebase to facilitate reproducibility.

Abstract

Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: \href{github}{https://github.com/liuyike422/CodingHomo

Paper Structure

This paper contains 24 sections, 15 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The overview of our work. We extract MVs while decoding frames and utilize them along with reconstruct images as input for homography estimation. We illustrate the error heatmap between target and warped source image in the (a) and (b), the darker the image, the better the alignment. Our result significantly outperforms existing method homogan in dynamic foregrounds scene with the prior MVs.
  • Figure 2: An example of network input. (a) Reconstruct image pair. (b) MVs. Green block indicates the dominant plane area. Red block donates a dynamic vehicle.
  • Figure 3: The overall pipeline of CodingHomo. Our network architecture consists of three stages: 1) Feature extraction stage. A CNN module for projecting input images into feature space and a multi-scale CNN encoder for generating feature pyramid. 2) Homography estimation stage. A block with cascaded MGF and MGHE blocks for predicting the homography from coarse to fine. 3) Mask prediction stage. A mask generated by $V_{ab}$, $H_{ab}$, $F_b$ and warped $F_a$($F'_a$) is applied to loss function to help the network focusing on dominant plane. Red arrows indicate the inference pipeline.
  • Figure 4: Illustration of the extraction of MVs during the decoding process. The input includes a reference frame and the bit stream of the current frame. Initially, the intermediate data of decoding process such as residual data and MVs, are typically discarded after decoding. In our approach, we preserve the MVs and output them along with the reconstructed frame.
  • Figure 5: The mask-guided fusion (MGF) module's structure involves utilizing pre-estimated $H_{ab}^i$ and scaled $V_{ab}^i$ to produce a motion rejection mask $M_m^i$. Subsequently, $M^i$, $H^i$ and $V^i$ are input into a fusion network to calculate a residual homography, which is then combined with $H^i_{ab}$ for final fused homography $\widetilde{H}_{ab}^i$.
  • ...and 9 more figures