Towards Visual Syntactical Understanding
Sayeed Shafayet Chowdhury, Soumyadeep Chandra, Kaushik Roy
TL;DR
This work introduces visual syntax by treating image parts as words and demonstrates that modern DNNs lack inherent visual syntactic understanding. It proposes a three-stage unsupervised framework—semantic part detection, part-based masking with a ViT-based masked autoencoder, and a syntax checker—to determine syntactic correctness and reconstruct a plausible correct configuration. The approach achieves high accuracy on CelebA ($92.10\%$) and AFHQ ($90.89\%$ in the abstract, with class-wise balances favoring $92.10\%$, $94.10\%$, and $87.69\%$ across classes) and generalizes to ImageNet/Caltech-101 without explicit training on them, while providing interpretable reconstructions that indicate which parts violate syntax. The work also analyzes its relation to OoD detection, presents ablations demonstrating the necessity of each module, and casts the method within a neuro-symbolic paradigm. Overall, the paper advances visual syntactic reasoning with a language-model-inspired training objective and highlights a practical path toward more robust, explainable visual reasoning systems.
Abstract
Syntax is usually studied in the realm of linguistics and refers to the arrangement of words in a sentence. Similarly, an image can be considered as a visual 'sentence', with the semantic parts of the image acting as 'words'. While visual syntactic understanding occurs naturally to humans, it is interesting to explore whether deep neural networks (DNNs) are equipped with such reasoning. To that end, we alter the syntax of natural images (e.g. swapping the eye and nose of a face), referred to as 'incorrect' images, to investigate the sensitivity of DNNs to such syntactic anomaly. Through our experiments, we discover an intriguing property of DNNs where we observe that state-of-the-art convolutional neural networks, as well as vision transformers, fail to discriminate between syntactically correct and incorrect images when trained on only correct ones. To counter this issue and enable visual syntactic understanding with DNNs, we propose a three-stage framework- (i) the 'words' (or the sub-features) in the image are detected, (ii) the detected words are sequentially masked and reconstructed using an autoencoder, (iii) the original and reconstructed parts are compared at each location to determine syntactic correctness. The reconstruction module is trained with BERT-like masked autoencoding for images, with the motivation to leverage language model inspired training to better capture the syntax. Note, our proposed approach is unsupervised in the sense that the incorrect images are only used during testing and the correct versus incorrect labels are never used for training. We perform experiments on CelebA, and AFHQ datasets and obtain classification accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes well to ImageNet samples which share common classes with CelebA and AFHQ without explicitly training on them.
