Table of Contents
Fetching ...

Learning Fourier shapes to probe the geometric world of deep neural networks

Jian Wang, Yixing Yong, Haixia Bi, Lijun He, Fan Li

TL;DR

The paper tackles how deep neural networks encode geometry by introducing a differentiable framework that learns shape-only representations via Fourier parameterization and a winding-number mapper to pixels. This enables three core capabilities: generating shapes that carry class-specific semantics, using shapes as high-fidelity, boundary-precise interpretability masks, and deploying shape-based adversarial attacks that generalize across detection and recognition tasks. Key findings show that shape alone can trigger high-confidence classifications across architectures, that shape masks isolate minimal salient regions with sharp boundaries outperforming Grad-CAM, and that optimized shapes can significantly degrade downstream detectors like YOLOv3, with attack strength scaling with shape complexity $K$ and transferring across models. Collectively, the work opens new avenues for probing, interpreting, and challenging machine perception through geometry, with potential extensions to data augmentation and 3D shapes for robust understanding of vision systems.

Abstract

While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model's salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

Learning Fourier shapes to probe the geometric world of deep neural networks

TL;DR

The paper tackles how deep neural networks encode geometry by introducing a differentiable framework that learns shape-only representations via Fourier parameterization and a winding-number mapper to pixels. This enables three core capabilities: generating shapes that carry class-specific semantics, using shapes as high-fidelity, boundary-precise interpretability masks, and deploying shape-based adversarial attacks that generalize across detection and recognition tasks. Key findings show that shape alone can trigger high-confidence classifications across architectures, that shape masks isolate minimal salient regions with sharp boundaries outperforming Grad-CAM, and that optimized shapes can significantly degrade downstream detectors like YOLOv3, with attack strength scaling with shape complexity and transferring across models. Collectively, the work opens new avenues for probing, interpreting, and challenging machine perception through geometry, with potential extensions to data augmentation and 3D shapes for robust understanding of vision systems.

Abstract

While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model's salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

Paper Structure

This paper contains 13 sections, 11 equations, 4 figures.

Figures (4)

  • Figure 1: Conceptual overview of adversarial shape learning. a, Human and machine visual systems rely on consistent shape and appearance attributes for robust object recognition. When these attributes are mismatched, such as an apple's shape with a banana's texture, perceptual conflict arises, illustrating that shape is an independently salient attribute. b, Prior work on adversarial attacks primarily targets the appearance domain. This involves either adding subtle, global pixel perturbations to misclassify an image (e.g., a panda recognized as a gibbon) or deploying localized adversarial patches to cause detection failures. These methods operate on pixel values without explicitly manipulating underlying geometry. c, Our framework enables end-to-end differentiable optimization of object shapes for adversarial machine learning. It addresses three key challenges: (1) Shape parameterization: Arbitrary closed contours are represented by a compact set of Fourier series coefficients. (2) Differentiable mapping: A module based on the winding number theorem translates these coefficients into a 2D grid image, creating a differentiable bridge to DNNs. (3) Effective optimization: Regularization, inspired by signal energy theory, guides the learning process to ensure physically plausible shapes by constraining high-frequency components. This integrated pipeline allows for the discovery and optimization of effective adversarial shapes. Images in b are from ref. Goodfellow_Explaining_2014 and Thys_CVPRW2019
  • Figure 2: Overview of the three experimental frameworks enabled by the differentiable shape learning pipeline. a, Experiment 1: Class-specific shape generation. A set of Fourier coefficients, $\mathbf{c} = \{c_k\}_{k=-K}^{K}$, is converted via the differentiable mapping into a gray-scale shape image. This image is fed directly into a classifier. The coefficients are optimized using gradient descent to maximize the classification confidence for a chosen target class, demonstrating the semantic representation capability of shape alone. b, Experiment 2: Shape as an interpretability tool. The Fourier coefficients are mapped to a gray-scale image, which is used as a mask on a given natural image. The masked input is fed into a classifier. The coefficients are optimized using two symmetric objectives: (1) to maximize the confidence for the true class while simultaneously minimizing the shape's area, thereby isolating the model's minimal salient region; or (2) to minimize the true class confidence while maximizing the shape's area, identifying the minimal critical region whose occlusion causes misclassification. c, Experiment 3: Shape as a generalizable adversarial paradigm. The Fourier coefficients are mapped to a gray-scale image, which is then rendered as an occlusion patch onto a target (e.g., a person) in a natural image. The rendered input is fed into an object detector. The coefficients are optimized to minimize the detection confidence scores for the occluded target, causing the model to fail the detection task.
  • Figure 3: Adversarial shapes generated from scratch can embody class-specific semantics. a, Qualitative examples of generated shapes by the ResNet-50 model. Left, a shape generated to be classified as tench using a complexity of $K=10$. Right, a more detailed shape generated for the golden retriever class using $K=25$. The top-5 classification predictions and their confidence scores are listed for each shape, demonstrating high confidence for the target class and semantically logical subsequent predictions. b, The effect of shape complexity on classification confidence for the ice bear class. As $K$ increases from 5 to 25, the shape incorporates more detail, and the target confidence monotonically increases from $1.14\%$ to $98.86\%$. c, Generalization of the learnable Fourier shape across diverse model architectures and all ImageNet classes. The plot displays the top-1 classification success rate as a function of shape complexity. Each curve represents a different model architecture. The success rate for each point is the average across all 1,000 ImageNet classes. For all models tested, the success rate consistently exceeds $90\%$ as $K$ increases beyond 20.
  • Figure 5: Adversarial shapes as a generalizable attack paradigm for object detection. a, Qualitative results of the shape attack against the YOLOv3 detector. In each pair, the left image shows the benign detection (person detected, green box) and the right image shows the attacked version. The optimized white Fourier shape ($K=10$) causes the detector to fail, and the person is no longer detected (detection confidence $\le 0.5$). b, Comparison of the optimized Fourier shape against simple geometric occlusions of similar area. While simple shapes (rectangle, ellipse, triangle, star) have a negligible effect on detection confidence (e.g., $93.2\%$ - $94.9\%$), the optimized shape reduces the confidence to $15.9\%$, successfully evading detection. c, Quantitative ablation on the effect of shape complexity across a set of 140 COCO images. The Attack Success Rate (ASR) vs. Confidence plot (left) shows that ASR (higher is better) increases with higher $K$. The Precision-Recall (PR) curves (right) show that the Average Precision (AP, lower is better) for the person class decreases as $K$ increases. d, Generalization of the shape attack ($K=10$) across diverse detector architectures (YOLOv3, RetinaNet, and FCOS). The ASR-Confidence plot (left) shows the attack is effective against all models. The PR curves (right) show a significant performance degradation for all attacked models (solid lines) compared to their benign baselines (dashed lines).