Table of Contents
Fetching ...

Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

Shijie Zhang

Abstract

Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.

Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

Abstract

Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.

Paper Structure

This paper contains 15 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our Multi-agent Collaborative Captioning (MACC) pipeline for MeteoCap-3B construction. Note that we used GPT-4o, Claude Sonnet 3.5, and Gemini-3-Flash in Phase 3 and used DeepSeek V3 liu2024deepseek in Phase 4. Human experts contains eight PhD in meteorology.
  • Figure 2: Representative samples across three subsets of MeteoCap-3B.
  • Figure 3: Our MTransformer for time series generation. Qwen-3-Embedding-7Byang2025qwen3 is used to achieve the caption embedding with 1024 dimensions.
  • Figure 4: Qualitative analysis. Left: Qualitative time series reconstruction. Right: Power spectral density of real and generated time series.
  • Figure 5: Scaling behavior of MTransformer with respect to model size (length of 96), data scale, and sequence length, evaluated on the Meteo-Volatile subset.
  • ...and 1 more figures