Table of Contents
Fetching ...

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef, Céline Hudelot, Pierre Colombo

Abstract

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

Abstract

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

Paper Structure

This paper contains 69 sections, 2 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Base (1): The original causal model. Bi+Base (2): The Base model with bidirectional attention enabled. Bi+MNTP (3): The Bi+Base model with an MNTP adaptation phase. Bi+Contrastive (4): The Bi+Base model with a contrastive adaptation phase. Bi+MNTP+Contrastive (5): The Bi+Base model adapted sequentially using MNTP followed by contrastive training. Intermediate dashed blocks denote adaptation phases.
  • Figure 2: Performance comparison of model variants across downstream tasks. Bars illustrate the absolute performance change relative to the unmodified Base model. Exact point differences are annotated above or below each bar.
  • Figure 3: Evolution of model performances during long run adaptation. Solid lines depict the absolute score change relative to the initial 10B adaptation, while dotted lines highlight the impact of complementary solutions to retain general knowledge.
  • Figure 4: Model performance across merging ratios. The first four columns report task scores, while the rightmost column reports the model ranking based on average normalized performance across all tasks. Merging ratio index dictates the interpolation weight.
  • Figure 5: Model performance across data mix ratios. The first four columns report task scores, while the rightmost column reports the model ranking based on average normalized performance across all tasks. The mix ratio specifies the proportion of multi-domain data.
  • ...and 11 more figures