Table of Contents
Fetching ...

Shape and Texture Recognition in Large Vision-Language Models

Sagi Eppel, Mor Bismut, Alona Faktor-Strugatski

TL;DR

This work introduces LAS&T, a large-scale, unsupervised dataset for evaluating how Large Vision-Language Models recognize and retrieve shapes and textures in 2D and 3D scenes. By systematically disentangling factors such as orientation, texture, background, and semantic versus natural shapes, the study reveals that current LVLMs rely heavily on high-level semantic cues and show substantial gaps in low-level shape and texture representation, especially under multiple transformations. Despite strong performance on some 3D material recognition tasks, humans consistently outperform LVLMs on both 2D shapes and 2D textures, while dedicated nets trained from scratch achieve near-perfect results, suggesting training data and objectives as key bottlenecks. LAS&T, freely available under CC0, provides a valuable resource for training and benchmarking perceptual capabilities in vision-language models, and highlights directions for improving low-level visual feature extraction in future models.

Abstract

Shapes and textures are the basic building blocks of visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures and materials independently of their associated objects, is essential for a general visual understanding of the world. This work introduces the Large Shapes and Textures dataset (LAS&T), a giant collection of highly diverse shapes and textures, created by unsupervised extraction of patterns from natural images. This dataset is used to benchmark how effectively leading Large Vision-Language Models (LVLM/VLM) recognize and represent shapes, textures, and materials in 2D and 3D scenes. For shape recognition, we test the models' ability to match images of identical shapes that differ in orientation, texture, color, or environment. Our results show that the shape-recognition capabilities of LVLMs remain well below human performance, especially when multiple transformations are applied. LVLMs rely predominantly on high-level and semantic features and struggle with abstract shapes lacking class associations. For texture and material recognition, we evaluated the models' ability to identify images with identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler, more abstract 2D textures and shapes. These results are consistent across a wide range of leading LVLMs (GPT/Gemini/LLama/Qwen) and foundation vision models (DINO/CLIP), exposing major deficiencies in the ability of VLMs to extract low-level visual features. In contrast, humans and simple nets trained directly for these tasks achieve high accuracy. The LAS&T dataset, featuring over 700,000 images for 2D/3D shape and textures recognition and retrieval, is freely available.

Shape and Texture Recognition in Large Vision-Language Models

TL;DR

This work introduces LAS&T, a large-scale, unsupervised dataset for evaluating how Large Vision-Language Models recognize and retrieve shapes and textures in 2D and 3D scenes. By systematically disentangling factors such as orientation, texture, background, and semantic versus natural shapes, the study reveals that current LVLMs rely heavily on high-level semantic cues and show substantial gaps in low-level shape and texture representation, especially under multiple transformations. Despite strong performance on some 3D material recognition tasks, humans consistently outperform LVLMs on both 2D shapes and 2D textures, while dedicated nets trained from scratch achieve near-perfect results, suggesting training data and objectives as key bottlenecks. LAS&T, freely available under CC0, provides a valuable resource for training and benchmarking perceptual capabilities in vision-language models, and highlights directions for improving low-level visual feature extraction in future models.

Abstract

Shapes and textures are the basic building blocks of visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures and materials independently of their associated objects, is essential for a general visual understanding of the world. This work introduces the Large Shapes and Textures dataset (LAS&T), a giant collection of highly diverse shapes and textures, created by unsupervised extraction of patterns from natural images. This dataset is used to benchmark how effectively leading Large Vision-Language Models (LVLM/VLM) recognize and represent shapes, textures, and materials in 2D and 3D scenes. For shape recognition, we test the models' ability to match images of identical shapes that differ in orientation, texture, color, or environment. Our results show that the shape-recognition capabilities of LVLMs remain well below human performance, especially when multiple transformations are applied. LVLMs rely predominantly on high-level and semantic features and struggle with abstract shapes lacking class associations. For texture and material recognition, we evaluated the models' ability to identify images with identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler, more abstract 2D textures and shapes. These results are consistent across a wide range of leading LVLMs (GPT/Gemini/LLama/Qwen) and foundation vision models (DINO/CLIP), exposing major deficiencies in the ability of VLMs to extract low-level visual features. In contrast, humans and simple nets trained directly for these tasks achieve high accuracy. The LAS&T dataset, featuring over 700,000 images for 2D/3D shape and textures recognition and retrieval, is freely available.

Paper Structure

This paper contains 22 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 2: Samples from LAS&T dataset for 2D textures and 3D materials recognition and retrieval tests. The model is asked to identify which of the image panels (B,C, or D) contains the same texture/material as the one in panel A. The object/shape on which the material/texture appears, as well as the background and illumination, can change between images.
  • Figure 3: Top: LAS&T dataset, Natural 2D shapes that were extracted from natural images using an unsupervised approach (\ref{['sec:shape_Extraction', 'fig:shape_extraction']}). Bottom: Semantic shapes corresponding to classes of objects from the COCO dataset (manual annotations) and 2D projections/silhouettes of 3D objects from the Objaverse dataset. It can be seen that the diversity and complexity of these semantic shapes are significantly lower compared to the LAS&T shapes.
  • Figure 4: Procedure for automatic shape extraction for the LAS&T dataset: a) Randomly select an image. b) Randomly choose one channel (R, G, B, H, S, or V). c) Apply a random threshold to binarize the selected channel. d) Choose a connected component cluster that exceeds the minimum size and thickness criteria and does not touch image boundaries. This can be viewed as a random sampling of natural patterns from images. Examples of extracted shapes are shown in \ref{['fig:sample_shapes']}.
  • Figure 5: Samples from the natural images 3D shape retrieval dataset. Each column shows real-world photos of the same 3D shape, coated with different materials and captured under different environments and orientations.
  • Figure 6: Format of LVLM automatic testing used to generate the results in \ref{['table:2dshapes', 'table:3dshapes', 'table:2d_textures', 'table:3d_materials']}. a) Standard testing method: the model receives the image and a text query asking which of the panels (B,C,D) contain a similar shape to the one in panel A. If the answer is not a single letter a second model is used to extract the answer as a single letter. b) Testing based on textual description (\ref{['text2text', 'table:2dshapes']} text to text). The model is asked to describe the shape in each panel independently (without referring to other panels). The generated text description is given to a second model with no memory or access to the image. The second model is asked to decide which of the panels (B,C,D) contain a similar shape to the one in panel A. The fact that both methods (a,b) give similar accuracy (\ref{['table:2dshapes']} text to text) implies that the information the model extracts regarding the shapes is mostly contained in the textual description it gives the shape.
  • ...and 2 more figures