Table of Contents
Fetching ...

SciLT: Long-Tailed Classification in Scientific Image Domains

Jiahao Chen, Bing Su

Abstract

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

SciLT: Long-Tailed Classification in Scientific Image Domains

Abstract

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

Paper Structure

This paper contains 44 sections, 6 theorems, 36 equations, 7 figures, 12 tables.

Key Result

Lemma 5.1

bartlett2002rademacher The hypothesis class of SciLT, $\mathcal{F}_{\text{SciLT}} = \mathcal{F}_{N-1} + \mathcal{F}_{N}$, satisfies where $\mathfrak{R}_S(\cdot)$ denotes the empirical Rademacher complexity computed on a training set $S$ of $n$ samples drawn i.i.d. from the data distribution $\mathcal{D}$. $\blacktriangleleft$$\blacktriangleleft$

Figures (7)

  • Figure 1: Differences in fine-tuning foundation models for downstream tasks on natural and scientific images. Fine-tuning achieves strong generalization performance on natural images (highlighted in blue), whereas its effectiveness on scientific images (highlighted in green) remains underexplored, due to the discrepancy in visual characteristics between scientific and natural image domains.
  • Figure 2: Relative gain on (a) Places365-LT and (b) iNaturalist2018 datasets with "Many", "Medium", and "Few" classes, respectively. RAC and LPT results on iNaturalist2018 are partially unavailable and therefore not fully plotted.
  • Figure 3: The performance curve on (a) NIH-Chest and (b) ISIC datasets with CE and LA training from scratch and fine-tuning. The class indices are sorted based on the number of samples belonging to each class. Curves are smoothed for better visualization.
  • Figure 4: The main architecture of SciLT. $z_{N-2}$, $s_1$, and $s_2$ denote the input hidden feature of the penultimate layer, predictions of the classifier1 and classifier2, respectively.
  • Figure 5: Caption
  • ...and 2 more figures

Theorems & Definitions (9)

  • Lemma 5.1
  • Lemma 5.2
  • Theorem 5.3
  • Lemma 4.1: Restated
  • proof
  • Lemma 4.2: Restated
  • proof
  • Theorem 4.3: Restated
  • proof