Table of Contents
Fetching ...

What explains the success of cross-modal fine-tuning with ORCA?

Paloma García-de-Herreros, Vagrant Gautam, Philipp Slusallek, Dietrich Klakow, Marius Mosbach

TL;DR

The study interrogates the origins of ORCA's cross-modal transfer performance by conducting targeted ablations across its stages, varying proxy datasets, and comparing pre-training scales. It finds that 2D tasks do not benefit from embedder training, while 1D tasks show limited gains and potential downsides from excessive embedder updates; in all cases, fine-tuning the pre-trained model is essential. Pre-training helps only for certain tasks and scales, and in some cases is unnecessary, underscoring the need for strong no-pretraining baselines. Overall, the work provides a nuanced view of what drives cross-modal fine-tuning success and calls for careful baselines and broader datasets to validate such transfer methods.

Abstract

ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.

What explains the success of cross-modal fine-tuning with ORCA?

TL;DR

The study interrogates the origins of ORCA's cross-modal transfer performance by conducting targeted ablations across its stages, varying proxy datasets, and comparing pre-training scales. It finds that 2D tasks do not benefit from embedder training, while 1D tasks show limited gains and potential downsides from excessive embedder updates; in all cases, fine-tuning the pre-trained model is essential. Pre-training helps only for certain tasks and scales, and in some cases is unnecessary, underscoring the need for strong no-pretraining baselines. Overall, the work provides a nuanced view of what drives cross-modal fine-tuning success and calls for careful baselines and broader datasets to validate such transfer methods.

Abstract

ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
Paper Structure (26 sections, 10 figures, 2 tables)

This paper contains 26 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The ORCA pipeline. Stage 2 involves training the task-specific embedder. Stage 3 fine-tunes the embedder, the pre-trained encoder, and the predictor.
  • Figure 2: Per-epoch fine-tuning performance on 2D tasks (above) and 1D tasks (below) when the embedder is trained with different proxy datasets or not trained at all, i.e., naive fine-tuning.
  • Figure 3: Per-epoch embedder training comparing OTDD ($\downarrow$) (metric minimized during this stage) to downstream task performance ($\downarrow$).
  • Figure 4: Freezing just the embedder, just the model, or both, before full fine-tuning. We also evaluate the impact of training vs. not training the embedder before freezing.
  • Figure 5: Effect of different amounts of pre-training data on downstream performance.
  • ...and 5 more figures