Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

Jingyi Hou; Leyu Zhou; Chenchen Jing; Jinghan Yang; Xinbo Yu; Wei He

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

Jingyi Hou, Leyu Zhou, Chenchen Jing, Jinghan Yang, Xinbo Yu, Wei He

TL;DR

iSTAR tackles the fragility of monolithic VLAs in long-horizon robotic manipulation by embedding a dynamic, in-parameter semantic structure into the model. It introduces a pre-action VLA concept extractor, a dynamic implicit concept graph with gating and relational reasoning, and a subtask prompt projector to create task-level commitments that guide action generation. Theoretical analyses show structured reasoning yields tighter generalization bounds than end-to-end policies in long horizons and low-information regimes, and empirical results on VIMA-Bench, LIBERO, and real-world UR3 tasks confirm improved task decomposition, robustness, and efficiency. This approach demonstrates that in-parameter semantic commitments can enable reusable, scalable task reasoning without external planners or prompt-based decomposition, with broad applicability across VLA backbones.

Abstract

As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models struggle with this form of task-level reasoning. They either depend on prompt-based in-context decomposition, which is unstable and sensitive to linguistic variations, or end-to-end long-horizon training, which requires large-scale demonstrations and entangles task-level reasoning with low-level control. We present in-parameter structured task reasoning (iSTAR), a framework for enhancing VLA models via functional differentiation induced by in-parameter structural reasoning. Instead of treating VLAs as monolithic policies, iSTAR embeds task-level semantic structure directly into model parameters, enabling differentiated task-level inference without external planners or handcrafted prompt inputs. This injected structure takes the form of implicit dynamic scene-graph knowledge that captures object relations, subtask semantics, and task-level dependencies in parameter space. Across diverse manipulation benchmarks, iSTAR achieves more reliable task decompositions and higher success rates than both in-context and end-to-end VLA baselines, demonstrating the effectiveness of parameter-space structural reasoning for functional differentiation and improved generalization across task variations.

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

TL;DR

Abstract

Paper Structure (61 sections, 5 theorems, 29 equations, 8 figures, 14 tables)

This paper contains 61 sections, 5 theorems, 29 equations, 8 figures, 14 tables.

Introduction
Related Work
Method
Concept Extraction
Dynamic Implicit Concept Graph Construction
Attribute gating.
Dynamic positional encoding.
Order gating and fusion.
Implicit relational reasoning.
Structure-learning objective.
Subtask Prompt Projector
Subtask Prompt Distillation and Supervision
Theory and Extensions
Structured Advantage in the Large-Horizon Regime
Proof Sketch.
...and 46 more sections

Key Result

Proposition 4.1

Consider discrete action tokens $a_t \in \mathcal{A}$. Suppose the effective complexity of semantic commitment over a horizon $T$ scales as while an end-to-end policy predicting action tokens induces If $\log|\mathcal{C}| < \log|\mathcal{A}|$ and the supervision budgets are of the same order, then there exists $T_0$ such that for all $T \ge T_0$, the generalization bound of the structured policy

Figures (8)

Figure 1: Our iSTAR transitions from a monolithic VLA with entangled reasoning to functionally differentiated modules where task-level structure (dynamic scene graphs) is injected into the model parameters. By the semantic-guided task resolution, iSTAR enhances the long-horizon reliability of VLA while maintaining end-to-end execution.
Figure 2: Overview of the proposed iSTAR framework.
Figure 3: Examples of real-world experiments. For each task, the prompt is shown along with the detailed execution stages, from initial state to final completion.
Figure 4: Example of demonstration segmentation and visual subtask classification. A demonstration of the task "put the white mug on the plate and put the chocolate pudding to the right of the plate" is segmented into four subtask-level segments. For each segment, eight representative keyframes are extracted and independently classified into one of the candidate subtasks. Segment-level predictions are merged in temporal order to form the final subtask sequence.
Figure 5: Hardware platform and sensing in our real-world experiments: one UR3 industrial robotic arm and two Intel RealSense D415i cameras. The left figure shows an overview of the experimental hardware platform, and the right figure provides a close-up view of the robotic arm.
...and 3 more figures

Theorems & Definitions (5)

Proposition 4.1: Structured advantage for large horizons
Lemma 2.1: Structured Error Decomposition
Lemma 2.2: Action Decoding Error
Lemma 2.3: Realization Error
Lemma 2.4: Generalization of Semantic Graph Prediction

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

TL;DR

Abstract

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)