Table of Contents
Fetching ...

Towards Visual Syntactical Understanding

Sayeed Shafayet Chowdhury, Soumyadeep Chandra, Kaushik Roy

TL;DR

This work introduces visual syntax by treating image parts as words and demonstrates that modern DNNs lack inherent visual syntactic understanding. It proposes a three-stage unsupervised framework—semantic part detection, part-based masking with a ViT-based masked autoencoder, and a syntax checker—to determine syntactic correctness and reconstruct a plausible correct configuration. The approach achieves high accuracy on CelebA ($92.10\%$) and AFHQ ($90.89\%$ in the abstract, with class-wise balances favoring $92.10\%$, $94.10\%$, and $87.69\%$ across classes) and generalizes to ImageNet/Caltech-101 without explicit training on them, while providing interpretable reconstructions that indicate which parts violate syntax. The work also analyzes its relation to OoD detection, presents ablations demonstrating the necessity of each module, and casts the method within a neuro-symbolic paradigm. Overall, the paper advances visual syntactic reasoning with a language-model-inspired training objective and highlights a practical path toward more robust, explainable visual reasoning systems.

Abstract

Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual 'sentence', with the semantic parts of the image acting as 'words'. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as 'incorrect' images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the 'words' (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage language model inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.

Towards Visual Syntactical Understanding

TL;DR

This work introduces visual syntax by treating image parts as words and demonstrates that modern DNNs lack inherent visual syntactic understanding. It proposes a three-stage unsupervised framework—semantic part detection, part-based masking with a ViT-based masked autoencoder, and a syntax checker—to determine syntactic correctness and reconstruct a plausible correct configuration. The approach achieves high accuracy on CelebA () and AFHQ ( in the abstract, with class-wise balances favoring , , and across classes) and generalizes to ImageNet/Caltech-101 without explicit training on them, while providing interpretable reconstructions that indicate which parts violate syntax. The work also analyzes its relation to OoD detection, presents ablations demonstrating the necessity of each module, and casts the method within a neuro-symbolic paradigm. Overall, the paper advances visual syntactic reasoning with a language-model-inspired training objective and highlights a practical path toward more robust, explainable visual reasoning systems.

Abstract

Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual 'sentence', with the semantic parts of the image acting as 'words'. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as 'incorrect' images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the 'words' (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage language model inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.
Paper Structure (27 sections, 15 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Predictions on syntactically correct and incorrect images using (a) 5 layer CNN, (b) CLIP$\textunderscore$ViT-B/32, (c) ResNet-101, (d) DEIT$\textunderscore$Tiny$\textunderscore$16$\textunderscore$224 (with relative positional encoding). For each pair, the correct image is on the left with the corresponding incorrect one on the right. The prediction probabilities are shown in parentheses with the predicted class.
  • Figure 2: Schematic of the proposed method. First, the input image is passed through a part detector (PD), and each detected word (part) is then sequentially masked and reconstructed. The words present in the reconstructed image are then detected using the same PD. Finally, a syntax checker compares the original and reconstructed parts at each location and evaluates syntactic correctness. Additionally, for incorrect inputs, interpretation is provided of what is incorrect.
  • Figure 3: ViT based autoencoder architecture. During training, some parts of the input are masked and the visible patches are encoded and padded with zero tokens. Then, the decoder is used to reconstruct the masked patches.
  • Figure 4: The reconstruction pipeline where the detected parts from the PD are masked sequentially and reconstructed using the autoencoder. Eventually, the output recovers the correct version of the input.
  • Figure 5: Visual results of the proposed method for correct as well as incorrect inputs from different classes.
  • ...and 10 more figures