Knot So Simple: A Minimalistic Environment for Spatial Reasoning
Zizhao Chen, Yoav Artzi
TL;DR
KnotGym presents a minimal yet challenging visual knot-manipulation environment to study spatial reasoning over deformable ropes using only image observations. It defines a measurable complexity axis via the number of crossings $#X$ and Gauss codes $GC$, enabling a principled curriculum and generalization tests across train/test splits and increasing task difficulty. The paper benchmarks RL (PPO, DreamerV3, TD-MPC2) and prompting-based VLMs, finding that unknot is approachable while tie and convert present substantial generalization and data-efficiency challenges due to the need to decode and act upon topological goals. By highlighting acute perception, continuous spatial reasoning, and a very large search space, KnotGym provides a rigorous testbed for future agents—ranging from world-models to multi-modal reasoning—that fuse perception, planning, and grounded manipulation in deformable-object contexts.
Abstract
We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.
