Table of Contents
Fetching ...

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu

TL;DR

The paper tackles the challenge of enabling immersive, social interactions with 3D autonomous characters by proposing SOLAMI, an end-to-end social vision-language-action model. It combines a decoder-only LLM backbone with dedicated speech and motion tokenizers to produce synchronized multimodal outputs, and it introduces SynMSI, a synthetic dataset generated from existing motion-text data to enable effective training. Through a three-stage training pipeline and an immersive VR interface, SOLAMI achieves more precise and natural character responses with lower latency than baselines, as demonstrated by quantitative metrics and a user study. The work advances embodied AI by integrating perception, language, and action in a unified model and provides a practical route to scalable, interactive digital humans in VR environments.

Abstract

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

TL;DR

The paper tackles the challenge of enabling immersive, social interactions with 3D autonomous characters by proposing SOLAMI, an end-to-end social vision-language-action model. It combines a decoder-only LLM backbone with dedicated speech and motion tokenizers to produce synchronized multimodal outputs, and it introduces SynMSI, a synthetic dataset generated from existing motion-text data to enable effective training. Through a three-stage training pipeline and an immersive VR interface, SOLAMI achieves more precise and natural character responses with lower latency than baselines, as demonstrated by quantitative metrics and a user study. The work advances embodied AI by integrating perception, language, and action in a unified model and provides a practical route to scalable, interactive digital humans in VR environments.

Abstract

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.

Paper Structure

This paper contains 29 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: SOLAMI enables the user to interact with 3D autonomous characters through speech and body language in an immersive VR environment via an end-to-end social vision-language-action model, which is trained on our synthesized multimodal dataset SynMSI.
  • Figure 2: Training pipeline of SOLAMI. We train SOLAMI through a three-stage process. In the pre-training stage, we train the model with motion-text and speech-text related tasks to align the speech and motion modalities with language. During the instruction tuning stage, we train the model with social multimodal multi-round interaction data, enabling it to generate multimodal responses that align with the character settings and the context of the topic.
  • Figure 3: SynMSI dataset generation. Our synthesizing pipeline consists of 4 steps. Based on numerous character-relevant topics and state-of-the-art LLMs gpt-4o, we generate text scripts for multimodal dialogues. Using a large-scale motion database inter-xhumanml3ddlp, we retrieve the most appropriate motions and refine the speech scripts accordingly. Finally, we employ TTS/voice cloning xtts to generate character-specific speech. This approach enables us to create multimodal interaction data of various characters using only existing motion datasets.
  • Figure 4: VR interface architecture. Our VR project consists of a Quest 3 client and a server. The Quest client captures and transmits user body motion and speech to the server. The server then generates character's speech, body motion, and face blendshape parameters based on the selected methods. The response is then sent back to the Quest client to drive the character.
  • Figure 5: Results of the user study with 95% confidence.
  • ...and 2 more figures