Table of Contents
Fetching ...

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

Nina Hosseini-Kivanani

TL;DR

PolyFrame addresses multimodal idiom disambiguation in multilingual settings by ranking images or captions conditioned on idiomatic expressions, while keeping frozen vision–language encoders and using lightweight, interpretable modules. The approach combines idiom-aware paraphrasing, sentence-type classification, and robust rank fusion across SigLIP2 and BGE M3 representations to achieve large gains over a CLIP baseline, including strong zero-shot transfer to Portuguese and competitive multilingual performance. Key contributions include idiom synonym replacement, explicit sentence-type signals, and a flexible fusion scheme that operates without fine-tuning large models. The results demonstrate practical viability for cross-lingual figurative language understanding in resource-constrained settings and highlight directions for extending fusion strategies and encoder-based models.

Abstract

Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.

PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation

TL;DR

PolyFrame addresses multimodal idiom disambiguation in multilingual settings by ranking images or captions conditioned on idiomatic expressions, while keeping frozen vision–language encoders and using lightweight, interpretable modules. The approach combines idiom-aware paraphrasing, sentence-type classification, and robust rank fusion across SigLIP2 and BGE M3 representations to achieve large gains over a CLIP baseline, including strong zero-shot transfer to Portuguese and competitive multilingual performance. Key contributions include idiom synonym replacement, explicit sentence-type signals, and a flexible fusion scheme that operates without fine-tuning large models. The results demonstrate practical viability for cross-lingual figurative language understanding in resource-constrained settings and highlight directions for extending fusion strategies and encoder-based models.

Abstract

Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
Paper Structure (9 sections, 1 figure, 3 tables)

This paper contains 9 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the final PolyFrame pipeline. Sentence typing via logistic regression, idiom replacement for idiomatic cases, three zero shot similarity streams with SigLIP2 and BGE M3, and Borda fusion