ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer
Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao
TL;DR
ParaDySe introduces a novel, layer-wise adaptive parallel-strategy switching framework for Transformer training with dynamic sequence lengths. It unifies tensor layouts across mainstream parallel methods, builds sequence-aware hybrid time-memory cost models, and employs a heuristic switching module to select optimal strategies on-the-fly without tensor redistribution. The approach demonstrates increased maximum trainable sequence lengths (up to 624K tokens) and substantial speedups across representative LLMs, effectively mitigating CPC and OOM bottlenecks. This work enables redistribution-free, memory-efficient scaling of LLMs in environments with extremely long sequences and dynamic inputs, advancing practical training of large models.
Abstract
Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.
