Visual Language Models as Operator Agents in the Space Domain

Alejandro Carrasco; Marco Nedungadi; Enrico M. Zucchelli; Amit Jain; Victor Rodriguez-Fernandez; Richard Linares

Visual Language Models as Operator Agents in the Space Domain

Alejandro Carrasco, Marco Nedungadi, Enrico M. Zucchelli, Amit Jain, Victor Rodriguez-Fernandez, Richard Linares

TL;DR

Problem: enabling autonomous, context-aware control in space by fusing visual and textual reasoning. Approach: a dual-pipeline end-to-end framework that uses Vision-Language Models for software rendezvous in the Kerbal Space Program Differential Games and hardware inspection with a RealSense-enabled xArm 7; observations are partitioned between vision and language modules and actions discretized into a tractable space, with a scoring function $dm\_lb^2 + \frac{a}{dm\_bg + b}$ guiding rendezvous objectives and a hardware action model $a_t = \mathcal{F}(I_t, S_t, P)$ over $(\mathbb{R}^6 \times \{0,1\})$. Findings: VLMs achieve competitive end-to-end performance versus LLMs and traditional baselines, though latency remains the main bottleneck; preliminary OpenVLA fine-tuning for robotic inspection shows feasible progress with limited data. Contributions: (i) a concrete end-to-end multimodal framework for space tasks, (ii) a detailed comparison of software and hardware operator pipelines, (iii) design of prompt strategies and data augmentations that leverage visual cues, and (iv) initial empirical results highlighting latency and generalization aspects. Significance: demonstrates practical viability of multimodal autonomous reasoning for rendezvous, docking, servicing, and satellite diagnostics, and outlines concrete steps for reducing latency and extending to humanoid platforms.

Abstract

This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.

Visual Language Models as Operator Agents in the Space Domain

TL;DR

guiding rendezvous objectives and a hardware action model

over

. Findings: VLMs achieve competitive end-to-end performance versus LLMs and traditional baselines, though latency remains the main bottleneck; preliminary OpenVLA fine-tuning for robotic inspection shows feasible progress with limited data. Contributions: (i) a concrete end-to-end multimodal framework for space tasks, (ii) a detailed comparison of software and hardware operator pipelines, (iii) design of prompt strategies and data augmentations that leverage visual cues, and (iv) initial empirical results highlighting latency and generalization aspects. Significance: demonstrates practical viability of multimodal autonomous reasoning for rendezvous, docking, servicing, and satellite diagnostics, and outlines concrete steps for reducing latency and extending to humanoid platforms.

Abstract

Paper Structure (22 sections, 4 equations, 4 figures, 4 tables)

This paper contains 22 sections, 4 equations, 4 figures, 4 tables.

Introduction
Background
Large Language Models (LLMs)
Prompt Engineering
Zero-shot Prompting
Few-shot Prompting
Advanced Prompting Paradigms
Structured Outputs
Function Calling
Vision-Language Models (VLMs)
Kerbal Space Program Differential Games (KSPDG)
Pursuer-Evader Scenarios
Lady-Bandit-Guard Scenarios
VLMs as Software operators - use case in Kerbal Space Program
Prior Prompt Engineering
...and 7 more sections

Figures (4)

Figure 1: Overview of the proposed approach to use an LLM (e.g., ChatGPT) as an autonomous spacecraft operator, with image prompts derived from in-game screenshots.
Figure 3: Overview of the LLM-based robotic control system for space hardware inspection and diagnosis.
Figure : Visual Input
Figure : Visual Input

Visual Language Models as Operator Agents in the Space Domain

TL;DR

Abstract

Visual Language Models as Operator Agents in the Space Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (4)