Table of Contents
Fetching ...

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu

TL;DR

LLaVA-ST tackles the challenge of unified fine-grained spatial–temporal understanding in multimodal large language models by introducing Language-Aligned Positional Embedding (LAPE) to embed coordinate tokens into the visual space and a Spatial-Temporal Packer (STP) to decouple temporal and spatial feature compression. A large ST-Align dataset (about 4.3M samples across 15 tasks) enables a progressive three-stage training pipeline—content alignment, coordinate alignment, and multi-task instruction tuning—resulting in end-to-end capabilities for Spatial-Temporal Video Grounding (STVG), Event Localization and Captioning (ELC), and Spatial Video Grounding (SVG). The approach achieves state-of-the-art performance on 11 benchmarks requiring fine-grained temporal, spatial, or interleaved understanding, including notable gains in STVG, SVG, TVG, and REC tasks, validating the benefits of preserving spatiotemporal detail and joint alignment. The work provides a practical path toward robust, end-to-end spatio-temporal multimodal reasoning with potential applications in video analysis, robotics, and interactive AI systems.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

TL;DR

LLaVA-ST tackles the challenge of unified fine-grained spatial–temporal understanding in multimodal large language models by introducing Language-Aligned Positional Embedding (LAPE) to embed coordinate tokens into the visual space and a Spatial-Temporal Packer (STP) to decouple temporal and spatial feature compression. A large ST-Align dataset (about 4.3M samples across 15 tasks) enables a progressive three-stage training pipeline—content alignment, coordinate alignment, and multi-task instruction tuning—resulting in end-to-end capabilities for Spatial-Temporal Video Grounding (STVG), Event Localization and Captioning (ELC), and Spatial Video Grounding (SVG). The approach achieves state-of-the-art performance on 11 benchmarks requiring fine-grained temporal, spatial, or interleaved understanding, including notable gains in STVG, SVG, TVG, and REC tasks, validating the benefits of preserving spatiotemporal detail and joint alignment. The work provides a practical path toward robust, end-to-end spatio-temporal multimodal reasoning with potential applications in video analysis, robotics, and interactive AI systems.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .
Paper Structure (22 sections, 8 equations, 15 figures, 10 tables)

This paper contains 22 sections, 8 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: (Left) LLaVA-ST demonstrates high performance across various tasks of fine-grained multimodal understanding and is the first MLLM capable of simultaneously processing spatial-temporal fine-grained understanding tasks. (Right) Examples of spatial-temporal interleaved fine-grained understanding tasks in the proposed ST-Align, which include Spatial Temporal Video Grounding (STVG), Event Localization and Captioning (ELC), and Spatial Video Grounding (SVG).
  • Figure 2: The Overall Architecture of LLaVA-ST. In LLaVA-ST, we introduce discrete special tokens to represent spatio-temporal coordinates within the language modality. LAPE embed these coordinate representations into the visual feature space. Furthermore, the STP module utilizes a two-stream packing mechanism to efficiently compress the features.
  • Figure 3: Details of the LAPE. LAPE leverages coordinate-related input text embeddings and features within the output layer matrix as visual positional embeddings.
  • Figure 4: The architecture of $\text{packer}_s$, and $\text{packer}_t$ shares the similar architecture.
  • Figure 5: Prompt of instruction data generated from GranD.$<box>$ indicates the bounding box of the object and $<object>$ represents the corresponding language of the object.
  • ...and 10 more figures