Table of Contents
Fetching ...

OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding

Jieer Ouyang, Xiaoneng Xiang, Zheng Wang, Yangkai Ding

TL;DR

OTTER addresses the need for robust open-set multi-modal tagging by blending a stable predefined taxonomy with flexible open labels. It introduces a two-tier tagging framework and a large, hierarchically organized dataset annotated through a hybrid vision-language pipeline, training a multi-head attention model to align fixed and open-set label embeddings with fused visual-text representations. The method achieves state-of-the-art performance on Otter and Favorite datasets, including near-perfect open-set F1 scores, demonstrating strong grounding of open vocabulary without sacrificing predefined tag accuracy. This work highlights the practical potential for scalable, personalized tagging in diverse multimedia collections and establishes a strong baseline for future open-set multi-modal classification research.

Abstract

We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER's effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.

OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding

TL;DR

OTTER addresses the need for robust open-set multi-modal tagging by blending a stable predefined taxonomy with flexible open labels. It introduces a two-tier tagging framework and a large, hierarchically organized dataset annotated through a hybrid vision-language pipeline, training a multi-head attention model to align fixed and open-set label embeddings with fused visual-text representations. The method achieves state-of-the-art performance on Otter and Favorite datasets, including near-perfect open-set F1 scores, demonstrating strong grounding of open vocabulary without sacrificing predefined tag accuracy. This work highlights the practical potential for scalable, personalized tagging in diverse multimedia collections and establishes a strong baseline for future open-set multi-modal classification research.

Abstract

We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER's effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.

Paper Structure

This paper contains 23 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of the OTTER framework. An input image, a wedding photo, is tagged with both a fixed category (Life Moments) and a personalized open label (Wedding Planning). OTTER enables accurate assignment within a predefined category space while flexibly accommodating user-specific custom tags.
  • Figure 2: Illustration of OTTER model architecture. The model employs a multi-head attention mechanism in which fixed and open-set label embeddings, encoded by a shared text encoder, serve as queries attending over fused visual and textual features derived from images or text inputs. Visual features from a vision backbone and textual features from OCR-based keyword extraction or direct text processing are aligned in a shared embedding space, summed to form keys and values, and processed through attention, adaptive average pooling, and a sigmoid layer to yield independent label probabilities.
  • Figure 3: Illustration of OTTER training strategy. Training data are constructed by combining six fixed predefined labels with a set of open-set labels drawn from a global label pool, where sampled labels differ from the ground truth and serve as negative examples alongside the true labels in the input.