Table of Contents
Fetching ...

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng

Abstract

We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Beyond reproducing existing systems, NeuralOS shows that synthesized training data can teach the model to simulate applications that were never installed, as illustrated by a Doom application, and suggests a path toward learning user interfaces purely from synthetic demonstrations.

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Abstract

We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Beyond reproducing existing systems, NeuralOS shows that synthesized training data can teach the model to simulate applications that were never installed, as illustrated by a Doom application, and suggests a path toward learning user interfaces purely from synthetic demonstrations.

Paper Structure

This paper contains 57 sections, 10 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: Real image sequence predicted by NeuralOS, illustrating the model's ability to simulate realistic GUI interactions. The sequence shows key frames as a user (a–c) opens and closes the "Home" folder, followed by (d–f) launches and closes a Doom application that was trained into the model using synthetic demonstrations. Cursor positions are highlighted with red circles. Frames are generated autoregressively, conditioned on previous frames and user inputs.
  • Figure 2: NeuralOS Model Architecture. (a) High-level architecture of NeuralOS. At each timestep, an RNN tracks the operating system's internal state based on user inputs (cursor positions, mouse clicks, keyboard events) and previously generated frames. This state is then passed as context to a diffusion-based renderer (UNet) that generates the next graphical frame. (b) Detailed two-level RNN structure at timestep t. The lower-level LSTM encodes user inputs, and then integrates visual information from the previous frame using attention. Its output is passed to the upper-level LSTM, which further processes these attention-informed representations. Feedback from the upper-level LSTM to the lower-level LSTM ($U_{t-1}$) ensures that the lower-level LSTM maintains awareness of upper-level state context and previous attention results. The combined outputs of both LSTMs, and cursor position encoding, form the renderer context. This hierarchical structure maintains constant computational complexity per timestep and supports continuous state updates during inference, essential for real-time OS interface simulation.
  • Figure 3: Multi-stage training pipeline for NeuralOS. (1) RNN Pretraining: The RNN is pretrained to predict latent frames using a mean squared error (MSE) loss. (2) Joint Training: The pretrained RNN and the diffusion-based renderer are jointly optimized using a standard diffusion loss. (3) Scheduled Sampling: To mitigate error accumulation caused by exposure bias, the most recent input frame is occasionally replaced by a previously generated frame. (4) Context Length Extension: The input context is extended to enable the model to capture long-term dependencies.
  • Figure 4: (a) Heatmap illustrating predicted vs. ground truth state transitions. Each cell represents the percentage of predictions assigned to a particular predicted cluster (x-axis), given a ground-truth cluster (y-axis). Only the top 16 clusters are displayed here; refer to \ref{['fig:full_transition']} for the complete heatmap. (b) Comparison of cursor position errors for NeuralOS (with cursor position map), NeuralOS without the cursor position map, and a random baseline. (c) Pixel RMSE of generated frames of using versus not using stage 3 scheduled sampling training.
  • Figure 5: Doom interactions generated by NeuralOS. Doom was never installed in the underlying operating system used for data collection; instead, the model learned to simulate the application from synthesized training data, producing realistic walking and shooting behavior from user inputs.
  • ...and 16 more figures