Table of Contents
Fetching ...

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

Shunlei Li, Longsen Gao, Jin Wang, Chang Che, Xi Xiao, Jiuwen Cao, Yingbai Hu, Hamid Reza Karimi

TL;DR

Graph-Fused Vision-Language-Action (GF-VLA) is proposed, a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations, demonstrating strong generalization and robustness across diverse spatial and semantic variations.

Abstract

Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

TL;DR

Graph-Fused Vision-Language-Action (GF-VLA) is proposed, a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations, demonstrating strong generalization and robustness across diverse spatial and semantic variations.

Abstract

Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.

Paper Structure

This paper contains 45 sections, 17 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An overview of the GF-VLA framework that performs policy transfer from a single human demonstration to a dual-arm robot manipulation task.
  • Figure 2: The relocation of a single object being manipulated over time. (a). The trajectory of a single object is shown, with the sliding window $\phi$ applied to the signal $\mathcal{X}(t)$ representing the object's position over time. (b) The entropy $\mathcal{H}^{\mathcal{X}(t)}$ is computed by sliding the window across $\mathcal{X}(t)$ and evaluating the entropy of the distribution of positions within each windowed segment. A bell-shaped curve emerges, highlighting periods of significant positional change.
  • Figure 3: The scenario of a hand moving an object over time. (a). The hand’s position is denoted by $\mathcal{Y}(t)$ and the object’s by $\mathcal{X}(t)$, with both signals captured simultaneously. (b). The mutual information $\boldsymbol{\varpi}_{\mathcal{X}|\mathcal{Y}}$ is calculated between the hand and object signals across the same windowed intervals. The resulting curve indicates time instances of coordinated motion, peaking when the hand and object move jointly.
  • Figure 4: (a) The conceptual representation of the dual-hand selection policy. The framework depends on the priority of the Left hand $h_L$ or the Right hand $h_R$, which is optimal for interacting with the manipulated object $o_m$ to move it to the target pose. (b) denotes Coupled-Motion integration between only the left-hand $h_L$ and one manipulated Jenga block $o_m$. (c) denotes the Docked interaction between $h_L$ and one Jenga block $o_m$. (d) denotes E-OO interaction between the manipulated jenga block $o_{m}$ and one background jenga block $o_b$ on the table. (e) denotes the T-OO interaction between one manipulated Jenga block $o_{m}$ and the current background Jenga block $o_b$ when the hand is shaken to make the building blocks shift slightly near the target position.
  • Figure 5: Policy transfer from a single human demonstration to a novel dual-arm robotic assembly task. The framework processes multimodal inputs, including language commands and visual scene data, using a SAM 2-based module to extract and project features into a shared embedding space. At its core, a unified Large Language Model (LLM) employs a dual-head structure: the LLM Head performs high-level semantic planning and validation using Chain-of-Thought (CoT) and self-verification, while the Action Head generates low-level, executable actions for manipulators. This integrated design enables the robot to de-tokenize abstract reasoning into physically grounded joint and gripper commands, translating high-level goals into robust, real-world execution.
  • ...and 6 more figures