Table of Contents
Fetching ...

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Zhuchenyang Liu, Yao Zhang, Yu Xiao

Abstract

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Abstract

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

Paper Structure

This paper contains 68 sections, 6 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: One example per task type. T1--T4 test cross-depiction alignment at increasing difficulty; D1--D2 isolate video and assembly instruction understanding respectively. Green borders mark correct options. Full prompt templates are in Suppl. A.
  • Figure 2: T1 accuracy vs. model size (log scale). Different families scale at different rates, and generational improvements at $\sim$8B consistently exceed within-family scaling.
  • Figure 3: Average accuracy across 17 models per task and strategy. D1 is invariant to strategy (hard video ceiling), while D2 shows massive recovery with text (+23.6 pp).
  • Figure 4: Diagram influence on the LLM prediction representation, measured by cosine similarity between the last-token hidden state and diagram token hidden states. Adding text (V+T) reduces diagram influence in 3 of 4 models ($n$=100 T1 questions per model).
  • Figure 5: Per-modality attention share in Qwen3-VL-8B on T1 ($n$=100). Adding text halves diagram attention ($-$52%) and reduces video attention ($-$34%).
  • ...and 1 more figures