Table of Contents
Fetching ...

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, Jie Hu

TL;DR

LongCat-Image introduces a compact 6B open-source bilingual diffusion model that achieves high photorealism and superior Chinese text rendering without relying on massive parameter counts. The authors emphasize data quality, staged training (pre-, mid-, post-training), and RLHF with multiple objective rewards to reach SOTA-like performance among open-source models. A key contribution is the comprehensive open-source ecosystem, including mid-training checkpoints and full training code, enabling researchers to reproduce and extend the work. The work demonstrates strong performance in image generation and editing while maintaining efficiency, broad multilingual capabilities, and robust open-science practices with an emphasis on accessibility and reuse.

Abstract

We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.

LongCat-Image Technical Report

TL;DR

LongCat-Image introduces a compact 6B open-source bilingual diffusion model that achieves high photorealism and superior Chinese text rendering without relying on massive parameter counts. The authors emphasize data quality, staged training (pre-, mid-, post-training), and RLHF with multiple objective rewards to reach SOTA-like performance among open-source models. A key contribution is the comprehensive open-source ecosystem, including mid-training checkpoints and full training code, enabling researchers to reproduce and extend the work. The work demonstrates strong performance in image generation and editing while maintaining efficiency, broad multilingual capabilities, and robust open-science practices with an emphasis on accessibility and reuse.

Abstract

We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.

Paper Structure

This paper contains 64 sections, 11 equations, 28 figures, 11 tables.

Figures (28)

  • Figure 2: High-fidelity text-to-image generation results.
  • Figure 3: Showcase of versatile capabilities in general image editing.
  • Figure 4: Showcase on complex and comprehensive editing scenarios. Beyond basic edits, LongCat-Image-Edit exhibits robust handling of intricate modifications and composite instructions.
  • Figure 5: Overview of training data.
  • Figure 6: Data curation pipeline. The pipeline consists of four stages: (1) Filtering: Raw data undergoes deduplication and quality assessment, including watermark and AIGC detection. (2) Meta Information Extraction: We extract comprehensive metadata, such as aesthetic scores, named entities, and OCR text. (3) Multi-Granularity Captioning: Leveraging the extracted metadata and prompt templates, a VLM generates captions ranging from entity-level tags to detailed photographic descriptions. (4) Stratification: The dataset is stratified into a pyramid structure based on style, quality, and content to support progressive training stages.
  • ...and 23 more figures