Table of Contents
Fetching ...

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

Jingyi Hou, Leyu Zhou, Chenchen Jing, Jinghan Yang, Xinbo Yu, Wei He

TL;DR

iSTAR tackles the fragility of monolithic VLAs in long-horizon robotic manipulation by embedding a dynamic, in-parameter semantic structure into the model. It introduces a pre-action VLA concept extractor, a dynamic implicit concept graph with gating and relational reasoning, and a subtask prompt projector to create task-level commitments that guide action generation. Theoretical analyses show structured reasoning yields tighter generalization bounds than end-to-end policies in long horizons and low-information regimes, and empirical results on VIMA-Bench, LIBERO, and real-world UR3 tasks confirm improved task decomposition, robustness, and efficiency. This approach demonstrates that in-parameter semantic commitments can enable reusable, scalable task reasoning without external planners or prompt-based decomposition, with broad applicability across VLA backbones.

Abstract

As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models struggle with this form of task-level reasoning. They either depend on prompt-based in-context decomposition, which is unstable and sensitive to linguistic variations, or end-to-end long-horizon training, which requires large-scale demonstrations and entangles task-level reasoning with low-level control. We present in-parameter structured task reasoning (iSTAR), a framework for enhancing VLA models via functional differentiation induced by in-parameter structural reasoning. Instead of treating VLAs as monolithic policies, iSTAR embeds task-level semantic structure directly into model parameters, enabling differentiated task-level inference without external planners or handcrafted prompt inputs. This injected structure takes the form of implicit dynamic scene-graph knowledge that captures object relations, subtask semantics, and task-level dependencies in parameter space. Across diverse manipulation benchmarks, iSTAR achieves more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines, demonstrating the effectiveness of parameter-space structural reasoning for functional differentiation and improved generalization across task variations.

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

TL;DR

iSTAR tackles the fragility of monolithic VLAs in long-horizon robotic manipulation by embedding a dynamic, in-parameter semantic structure into the model. It introduces a pre-action VLA concept extractor, a dynamic implicit concept graph with gating and relational reasoning, and a subtask prompt projector to create task-level commitments that guide action generation. Theoretical analyses show structured reasoning yields tighter generalization bounds than end-to-end policies in long horizons and low-information regimes, and empirical results on VIMA-Bench, LIBERO, and real-world UR3 tasks confirm improved task decomposition, robustness, and efficiency. This approach demonstrates that in-parameter semantic commitments can enable reusable, scalable task reasoning without external planners or prompt-based decomposition, with broad applicability across VLA backbones.

Abstract

As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models struggle with this form of task-level reasoning. They either depend on prompt-based in-context decomposition, which is unstable and sensitive to linguistic variations, or end-to-end long-horizon training, which requires large-scale demonstrations and entangles task-level reasoning with low-level control. We present in-parameter structured task reasoning (iSTAR), a framework for enhancing VLA models via functional differentiation induced by in-parameter structural reasoning. Instead of treating VLAs as monolithic policies, iSTAR embeds task-level semantic structure directly into model parameters, enabling differentiated task-level inference without external planners or handcrafted prompt inputs. This injected structure takes the form of implicit dynamic scene-graph knowledge that captures object relations, subtask semantics, and task-level dependencies in parameter space. Across diverse manipulation benchmarks, iSTAR achieves more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines, demonstrating the effectiveness of parameter-space structural reasoning for functional differentiation and improved generalization across task variations.
Paper Structure (61 sections, 5 theorems, 29 equations, 8 figures, 14 tables)

This paper contains 61 sections, 5 theorems, 29 equations, 8 figures, 14 tables.

Key Result

Proposition 4.1

Consider discrete action tokens $a_t \in \mathcal{A}$. Suppose the effective complexity of semantic commitment over a horizon $T$ scales as while an end-to-end policy predicting action tokens induces If $\log|\mathcal{C}| < \log|\mathcal{A}|$ and the supervision budgets are of the same order, then there exists $T_0$ such that for all $T \ge T_0$, the generalization bound of the structured policy

Figures (8)

  • Figure 1: Our iSTAR transitions from a monolithic VLA with entangled reasoning to functionally differentiated modules where task-level structure (dynamic scene graphs) is injected into the model parameters. By the semantic-guided task resolution, iSTAR enhances the long-horizon reliability of VLA while maintaining end-to-end execution.
  • Figure 2: Overview of the proposed iSTAR framework.
  • Figure 3: Examples of real-world experiments. For each task, the prompt is shown along with the detailed execution stages, from initial state to final completion.
  • Figure 4: Example of demonstration segmentation and visual subtask classification. A demonstration of the task "put the white mug on the plate and put the chocolate pudding to the right of the plate" is segmented into four subtask-level segments. For each segment, eight representative keyframes are extracted and independently classified into one of the candidate subtasks. Segment-level predictions are merged in temporal order to form the final subtask sequence.
  • Figure 5: Hardware platform and sensing in our real-world experiments: one UR3 industrial robotic arm and two Intel RealSense D415i cameras. The left figure shows an overview of the experimental hardware platform, and the right figure provides a close-up view of the robotic arm.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Proposition 4.1: Structured advantage for large horizons
  • Lemma 2.1: Structured Error Decomposition
  • Lemma 2.2: Action Decoding Error
  • Lemma 2.3: Realization Error
  • Lemma 2.4: Generalization of Semantic Graph Prediction