Visual Language Models as Operator Agents in the Space Domain
Alejandro Carrasco, Marco Nedungadi, Enrico M. Zucchelli, Amit Jain, Victor Rodriguez-Fernandez, Richard Linares
TL;DR
Problem: enabling autonomous, context-aware control in space by fusing visual and textual reasoning. Approach: a dual-pipeline end-to-end framework that uses Vision-Language Models for software rendezvous in the Kerbal Space Program Differential Games and hardware inspection with a RealSense-enabled xArm 7; observations are partitioned between vision and language modules and actions discretized into a tractable space, with a scoring function $dm\_lb^2 + \frac{a}{dm\_bg + b}$ guiding rendezvous objectives and a hardware action model $a_t = \mathcal{F}(I_t, S_t, P)$ over $(\mathbb{R}^6 \times \{0,1\})$. Findings: VLMs achieve competitive end-to-end performance versus LLMs and traditional baselines, though latency remains the main bottleneck; preliminary OpenVLA fine-tuning for robotic inspection shows feasible progress with limited data. Contributions: (i) a concrete end-to-end multimodal framework for space tasks, (ii) a detailed comparison of software and hardware operator pipelines, (iii) design of prompt strategies and data augmentations that leverage visual cues, and (iv) initial empirical results highlighting latency and generalization aspects. Significance: demonstrates practical viability of multimodal autonomous reasoning for rendezvous, docking, servicing, and satellite diagnostics, and outlines concrete steps for reducing latency and extending to humanoid platforms.
Abstract
This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.
