Table of Contents
Fetching ...

Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Xiangyu Zhu, Chang Yu, Jiankuo Zhao, Zhaoxiang Zhang, Stan Z. Li, Zhen Lei

TL;DR

A graphics probe is introduced, a sub-network crafted to reconstruct the original image from the network's intermediate layers, providing empirical support for Marr's theory of vision.

Abstract

David Marr's seminal theory of vision proposes that the human visual system operates through a sequence of three stages, known as the 2D sketch, the 2.5D sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been widely thought to have reached a level comparable to human vision. However, the mechanisms by which DNNs accomplish this and whether they adhere to Marr's 2D--2.5D--3D construction theory remain unexplored. In this paper, we delve into the perception task to explore these questions and find evidence supporting Marr's theory. We introduce a graphics probe, a sub-network crafted to reconstruct the original image from the network's intermediate layers. The key to the graphics probe is its flexible architecture that supports image in both 2D and 3D formats, as well as in a transitional state between them. By injecting graphics probes into neural networks, and analyzing their behavior in reconstructing images, we find that DNNs initially encode images as 2D representations in low-level layers, and finally construct 3D representations in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid state, building a geometric representation that s sur normals within a narrow depth range, akin to the appearance of a low-relief sculpture. This stage resembles the 2.5D representations, providing a view of how DNNs evolve from 2D to 3D in the perception process. The graphics probe therefore serves as a tool for peering into the mechanisms of DNN, providing empirical support for Marr's theory.

Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

TL;DR

A graphics probe is introduced, a sub-network crafted to reconstruct the original image from the network's intermediate layers, providing empirical support for Marr's theory of vision.

Abstract

David Marr's seminal theory of vision proposes that the human visual system operates through a sequence of three stages, known as the 2D sketch, the 2.5D sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been widely thought to have reached a level comparable to human vision. However, the mechanisms by which DNNs accomplish this and whether they adhere to Marr's 2D--2.5D--3D construction theory remain unexplored. In this paper, we delve into the perception task to explore these questions and find evidence supporting Marr's theory. We introduce a graphics probe, a sub-network crafted to reconstruct the original image from the network's intermediate layers. The key to the graphics probe is its flexible architecture that supports image in both 2D and 3D formats, as well as in a transitional state between them. By injecting graphics probes into neural networks, and analyzing their behavior in reconstructing images, we find that DNNs initially encode images as 2D representations in low-level layers, and finally construct 3D representations in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid state, building a geometric representation that s sur normals within a narrow depth range, akin to the appearance of a low-relief sculpture. This stage resembles the 2.5D representations, providing a view of how DNNs evolve from 2D to 3D in the perception process. The graphics probe therefore serves as a tool for peering into the mechanisms of DNN, providing empirical support for Marr's theory.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The building of 2D--2.5D--3D representations in DNN.
  • Figure 2: Schematic of Graphics Probe. (a) During the probing process, a probe token interacts with the original tokens and generates multiple graphics probes to reconstruct the input image in a CG manner. (b) The architecture of the probed network. (c) The visualization of probed representations across different levels: 2D at the low level, 2.5D at the middle level, and 3D at the high level.
  • Figure 3: Visualization of intermediate representations. The geometry of representations at the low, middle, and high levels with receptive field (RF) corresponding to $\frac{1}{4}\times$, $\frac{1}{2}\times$, and the full image size, respectively. At the low level, the geometry is flat, lacking any depth or normal variations. At the middle level, variations in normal begin to appear, yet the depth remains shallow, similar to a low-relief sculpture. At the high level, a fully 3D representation is constructed.
  • Figure 4: Distribution of depth and normal variations. The distributions of variations for individual samples across the testing dataset for (a) depth, (b) x-axis of normal, (c) y-axis of normal, and (d) z-axis of normal. The mean variations for (e) depth and (f) normal throughout the training process.
  • Figure 5: The distribution of yaw angles across different levels and an illustration of viewer-centered and object-centered representations.
  • ...and 3 more figures