360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Haosheng Zou, Xiaowei Lv, Shousheng Jia, Lin Li, Xiaochun Gong, Xiangzheng Zhang
TL;DR
The paper tackles the challenge of long-context training by integrating sequence parallelism into the LLaMA-Factory framework. It implements Ring-Attention and DeepSpeed-Ulysses within LLaMA-Factory and introduces Dummy Head Ulysses to overcome head-divisibility constraints, while extending support to vision-language models with a placeholder-token approach. Through initialization, data processing, and loss-computation details, the authors provide a practical, compatible pathway for post-training with long sequences and systematic comparisons across methods in terms of correctness, throughput, and maximum sequence length. The work demonstrates memory and efficiency benefits, offers guidance for deployment, and outlines future directions to broaden model coverage and optimize training workflows for multimodal and long-sequence settings.
Abstract
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
