Symmetric Multi-Similarity Loss for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2024
Xiaoqi Wang, Yi Wang, Lap-Pui Chau
TL;DR
The paper tackles visual-text retrieval under the EK-100 Multi-Instance Retrieval setting by exploiting a soft-label relevancy matrix that captures partial matches between video and text. It introduces a Symmetric Multi-Similarity Loss (SMS) that symmetrically optimizes positive and negative pairs in the presence of soft labels, building on and refining the traditional Multi-Similarity framework. Through an inference augmentation trick (frame flipping) and an ensemble of diverse models, the method achieves state-of-the-art performance on the EK-100 public leaderboard, surpassing prior baselines in average mAP and nDCG. The work provides code and demonstrates that careful objective design together with simple inference-time tricks yields practical gains in large-scale video-text retrieval with soft supervision.
Abstract
In this report, we present our champion solution for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge in CVPR 2024. Essentially, this challenge differs from traditional visual-text retrieval tasks by providing a correlation matrix that acts as a set of soft labels for video-text clip combinations. However, existing loss functions have not fully exploited this information. Motivated by this, we propose a novel loss function, Symmetric Multi-Similarity Loss, which offers a more precise learning objective. Together with tricks and ensemble learning, the model achieves 63.76% average mAP and 74.25% average nDCG on the public leaderboard, demonstrating the effectiveness of our approach. Our code will be released at: https://github.com/xqwang14/SMS-Loss/tree/main
