What explains the success of cross-modal fine-tuning with ORCA?
Paloma García-de-Herreros, Vagrant Gautam, Philipp Slusallek, Dietrich Klakow, Marius Mosbach
TL;DR
The study interrogates the origins of ORCA's cross-modal transfer performance by conducting targeted ablations across its stages, varying proxy datasets, and comparing pre-training scales. It finds that 2D tasks do not benefit from embedder training, while 1D tasks show limited gains and potential downsides from excessive embedder updates; in all cases, fine-tuning the pre-trained model is essential. Pre-training helps only for certain tasks and scales, and in some cases is unnecessary, underscoring the need for strong no-pretraining baselines. Overall, the work provides a nuanced view of what drives cross-modal fine-tuning success and calls for careful baselines and broader datasets to validate such transfer methods.
Abstract
ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to ORCA's success. Therefore, we run a series of ablations and find that embedder training does not help 2D tasks at all, contrary to what the original paper posits. In 1D tasks, some amount of embedder training is necessary but more is not better. In 4 out of 6 datasets we experiment with, it is model fine-tuning that makes the biggest difference. Through our ablations and baselines, we contribute a better understanding of the individual components of ORCA.
