Table of Contents
Fetching ...

OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor, Mohan Li, Yifei Peng, Wang-Zhou Dai, Yao-Xiang Ding, Emanuele Sansone

Abstract

Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.

OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

Abstract

Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.
Paper Structure (22 sections, 5 equations, 4 figures, 5 tables)

This paper contains 22 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Visual summary of OrigamiBench. In the top-left corner (data), examples of crease patterns (.fold) from the animal class are shown together with their corresponding rendered origami images (PNG), ordered by increasing complexity. In the top-right (environment), we illustrate a single-step transition in the execution environment. The state consists of a crease pattern, where blue and red lines denote mountain and valley folds, respectively, along with its rendered origami. After receiving the output from the VLM model, the execution engine performs a foldability check for the given action; if successful, it generates a new state. The VLM model observes as input the initial prompt, the current environment state, and the target origami. Once the simulation is complete, the final state is compared against the target state using the proposed metrics, also leveraging the hidden target crease pattern. Finally, at the bottom (tasks), the two evaluation tasks are shown from the model’s perspective, highlighting the inputs and the corresponding desired outputs.
  • Figure 2: Examples for the one-step task. The top and the bottom three rows show Associative and Causal task examples, respectively. Additionally, three task examples with varying degree of complexity are shown for both Associative and Causal. Each task requires to choose the next folding step of the reference origami (first column) from four candidate options (A--D). The correct answer is highlighted using the green background. Coloured circles at the bottom-right corner of each option indicates the prediction made by each model.
  • Figure 3: Original rendered image (left) and its corresponding binary mask (right).
  • Figure 4: Examples of final origamis generated by the models for easy, medium, and hard targets after 10 and 25 inference steps. Both models produce origamis with only 1 or 2 fold actions, thus failing to construct a folding action plan. This is mainly due to the lack of causal understanding of an action as observed in the One-Step Multiple Choice evaluation.