Table of Contents
Fetching ...

Teach CLIP to Develop a Number Sense for Ordinal Regression

Yao Du, Qiang Zhai, Weihang Dai, Xiaomeng Li

TL;DR

This work addresses ordinal regression by evaluating CLIP's ability to generalise in ordinal tasks and identifying its limitations due to insufficient number-related pretraining. It introduces NumCLIP, a coarse-to-fine framework that maps numbers to linguistic concepts and employs a fine-grained cross-modal ranking regulariser to preserve both semantic and ordinal structure in CLIP's feature space. Empirically, NumCLIP achieves state-of-the-art-like performance across three benchmarks: MAE of $2.08$ on MORPH II, historical dating accuracy of $69.61\%$, and an image aesthetics MAE of $0.23$ with an overall accuracy of $76.53\%$, as well as strong few-shot performance. The approach demonstrates that leveraging language priors and ordinal-distance aware regularisation can extend cross-modal contrastive learning to number-sensitive, ordered prediction tasks, offering practical improvements for age estimation, dating, and aesthetics assessment.

Abstract

Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.

Teach CLIP to Develop a Number Sense for Ordinal Regression

TL;DR

This work addresses ordinal regression by evaluating CLIP's ability to generalise in ordinal tasks and identifying its limitations due to insufficient number-related pretraining. It introduces NumCLIP, a coarse-to-fine framework that maps numbers to linguistic concepts and employs a fine-grained cross-modal ranking regulariser to preserve both semantic and ordinal structure in CLIP's feature space. Empirically, NumCLIP achieves state-of-the-art-like performance across three benchmarks: MAE of on MORPH II, historical dating accuracy of , and an image aesthetics MAE of with an overall accuracy of , as well as strong few-shot performance. The approach demonstrates that leveraging language priors and ordinal-distance aware regularisation can extend cross-modal contrastive learning to number-sensitive, ordered prediction tasks, offering practical improvements for age estimation, dating, and aesthetics assessment.

Abstract

Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.
Paper Structure (25 sections, 6 equations, 4 figures, 6 tables)

This paper contains 25 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The framework of NumCLIP, aiming to teach CLIP to develop a strong number sense for ordinal regression. Replacing pure numbers as common language descriptions allow better utilising the pre-training knowledge, and cross-modal ranking-based feature regularisation ensures both semantic and ordinal alignment.
  • Figure 2: Mimic human numerical cognition: mapping an image feature to a language concept first, and then reasoning the number.
  • Figure 3: Fine-grained cross-modal ranking-based feature regularisation. The cross-modal negative samples are pushed away with ordinal label distance alignment.
  • Figure 4: t-SNE visualisation of 512D image feature of CLIP on MORPH II