Table of Contents
Fetching ...

Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales

Shuren Qi, Yushu Zhang, Chao Wang, Zhihua Xia, Xiaochun Cao, Jian Weng

TL;DR

Hierarchical Invariance introduces a principled framework, HIR, that extends moment invariants into a CNN-like cascade of covariant and invariant modules to achieve continuous translation/rotation/flipping equivariance and scale covariance. By defining layers $\,\mathbb{C}$, $\mathbb{S}$, $\mathbb{P}$, and $\mathbb{I}$ and organizing them along paths, the method produces globally invariant representations while preserving geometric information across layers. The paper provides fast FFT-based and high-accuracy numerical implementations, and demonstrates data-adaptivity via NAS-like path selection and cascading-learning to boost discriminability on texture, digit, and parasite tasks, plus robust forensics applications against adversarial perturbations and AIGC content. Across datasets and forensic benchmarks, HIR shows competitive or superior discriminability, robustness, and efficiency compared to hand-crafted, scattering, and CNN baselines, highlighting its potential as an interpretable alternative for robust vision at large scales.

Abstract

Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.

Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales

TL;DR

Hierarchical Invariance introduces a principled framework, HIR, that extends moment invariants into a CNN-like cascade of covariant and invariant modules to achieve continuous translation/rotation/flipping equivariance and scale covariance. By defining layers , , , and and organizing them along paths, the method produces globally invariant representations while preserving geometric information across layers. The paper provides fast FFT-based and high-accuracy numerical implementations, and demonstrates data-adaptivity via NAS-like path selection and cascading-learning to boost discriminability on texture, digit, and parasite tasks, plus robust forensics applications against adversarial perturbations and AIGC content. Across datasets and forensic benchmarks, HIR shows competitive or superior discriminability, robustness, and efficiency compared to hand-crafted, scattering, and CNN baselines, highlighting its potential as an interpretable alternative for robust vision at large scales.

Abstract

Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
Paper Structure (24 sections, 27 equations, 7 figures, 9 tables)

This paper contains 24 sections, 27 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The blueprint of hierarchical invariance, where the image information is able to pass through each intermediate layer in a geometrically controllable manner, and on the last layer, the invariant features are allowed by compact designs, with also sufficient information.
  • Figure 2: A single-scale practice of HIR with the invariance for $\mathfrak{G}_1$. This tree-like HIR network encodes a set of paths, where blue and black nodes denote representation units (with different parameters) and identity function, respectively; lines denote cascading relationships between nodes.
  • Figure 3: A multi-scale practice of HIR with the invariance for $\mathfrak{G}_0$. This multi-scale HIR network is based on scale separation prior, where the scaling covariance is transformed into a linear translation pattern between multi-scale networks. One can derive scale-invariant representations by pooling feature maps from a series of corresponding nodes at multiple scales
  • Figure 4: Illustration for the datasets from the computer vision and pattern recognition experiments.
  • Figure 5: Illustration for the datasets from the digital forensic and forgery detection experiments.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • proof
  • proof
  • proof
  • Definition 6
  • Definition 7