Table of Contents
Fetching ...

Visual Language Models as Operator Agents in the Space Domain

Alejandro Carrasco, Marco Nedungadi, Enrico M. Zucchelli, Amit Jain, Victor Rodriguez-Fernandez, Richard Linares

TL;DR

Problem: enabling autonomous, context-aware control in space by fusing visual and textual reasoning. Approach: a dual-pipeline end-to-end framework that uses Vision-Language Models for software rendezvous in the Kerbal Space Program Differential Games and hardware inspection with a RealSense-enabled xArm 7; observations are partitioned between vision and language modules and actions discretized into a tractable space, with a scoring function $dm\_lb^2 + \frac{a}{dm\_bg + b}$ guiding rendezvous objectives and a hardware action model $a_t = \mathcal{F}(I_t, S_t, P)$ over $(\mathbb{R}^6 \times \{0,1\})$. Findings: VLMs achieve competitive end-to-end performance versus LLMs and traditional baselines, though latency remains the main bottleneck; preliminary OpenVLA fine-tuning for robotic inspection shows feasible progress with limited data. Contributions: (i) a concrete end-to-end multimodal framework for space tasks, (ii) a detailed comparison of software and hardware operator pipelines, (iii) design of prompt strategies and data augmentations that leverage visual cues, and (iv) initial empirical results highlighting latency and generalization aspects. Significance: demonstrates practical viability of multimodal autonomous reasoning for rendezvous, docking, servicing, and satellite diagnostics, and outlines concrete steps for reducing latency and extending to humanoid platforms.

Abstract

This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.

Visual Language Models as Operator Agents in the Space Domain

TL;DR

Problem: enabling autonomous, context-aware control in space by fusing visual and textual reasoning. Approach: a dual-pipeline end-to-end framework that uses Vision-Language Models for software rendezvous in the Kerbal Space Program Differential Games and hardware inspection with a RealSense-enabled xArm 7; observations are partitioned between vision and language modules and actions discretized into a tractable space, with a scoring function guiding rendezvous objectives and a hardware action model over . Findings: VLMs achieve competitive end-to-end performance versus LLMs and traditional baselines, though latency remains the main bottleneck; preliminary OpenVLA fine-tuning for robotic inspection shows feasible progress with limited data. Contributions: (i) a concrete end-to-end multimodal framework for space tasks, (ii) a detailed comparison of software and hardware operator pipelines, (iii) design of prompt strategies and data augmentations that leverage visual cues, and (iv) initial empirical results highlighting latency and generalization aspects. Significance: demonstrates practical viability of multimodal autonomous reasoning for rendezvous, docking, servicing, and satellite diagnostics, and outlines concrete steps for reducing latency and extending to humanoid platforms.

Abstract

This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.
Paper Structure (22 sections, 4 equations, 4 figures, 4 tables)

This paper contains 22 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed approach to use an LLM (e.g., ChatGPT) as an autonomous spacecraft operator, with image prompts derived from in-game screenshots.
  • Figure 3: Overview of the LLM-based robotic control system for space hardware inspection and diagnosis.
  • Figure : Visual Input
  • Figure : Visual Input