Table of Contents
Fetching ...

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi, Takaaki Tachibana, Ryota Ito, Mitsugu Fujita, Kimihiro Yamashita, Yoshihiro Kakeji

TL;DR

For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data is available, and large pre-trained models offer limited benefit in the small-patch regime.

Abstract

Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256--16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.

Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

TL;DR

For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data is available, and large pre-trained models offer limited benefit in the small-patch regime.

Abstract

Background and objective: Cell-level pathological image analysis requires working with extremely small image patches (40x40 pixels), far below standard ImageNet resolutions. It remains unclear whether modern deep learning architectures and foundation models can learn robust and scalable representations under this constraint. We systematically evaluated architectural suitability and data-scale effects for small-patch cell classification. Methods: We analyzed 303 colorectal cancer specimens with CD103/CD8 immunostaining, generating 185,432 annotated cell images. Eight task-specific architectures were trained from scratch at multiple data scales (FlagLimit: 256--16,384 samples per class), and three foundation models were evaluated via linear probing and fine-tuning after resizing inputs to 224x224 pixels. Robustness to blur was assessed using pre- and post-resize Gaussian perturbations. Results: Task-specific models improved consistently with increasing data scale, whereas foundation models saturated at moderate sample sizes. A Vision Transformer optimized for small patches (CustomViT) achieved the highest accuracy, outperforming all foundation models with substantially lower inference cost. Blur robustness was comparable across architectures, with no qualitative advantage observed for foundation models. Conclusion: For cell-level classification under extreme spatial constraints, task-specific architectures are more effective and efficient than foundation models once sufficient training data are available. Higher clean accuracy does not imply superior robustness, and large pre-trained models offer limited benefit in the small-patch regime.
Paper Structure (28 sections, 3 equations, 11 figures, 5 tables)

This paper contains 28 sections, 3 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: A schematic phylogenetic overview of major deep learning architectures. Early multilayer perceptrons gave rise to convolutional neural networks, which evolved through deeper residual designs and efficiency-oriented variants. In parallel, recurrent models led to attention-based transformers, from which Vision Transformer emerged as a direct adaptation to visual tasks. The diagram also illustrates how transformer design principles subsequently influenced modern convolutional architectures such as ConvNeXt, reflecting cross-paradigm convergence in recent model development.
  • Figure 2: Trade-off between inference speed and classification performance at flag_limit = 4096. Each point represents a model's macro-F1 score (y-axis) versus its inference time per patch (x-axis). CustomViT achieves the highest classification performance while maintaining inference costs more than an order of magnitude lower than foundation models such as UNI. Task-specific convolutional models (CNN, ResNet, NIN, SE-ResNet) offer sub-millisecond to few-millisecond inference but plateau at substantially lower performance. Foundation models (ResNet-RS50, CTransPath, UNI) incur markedly higher inference latency, with UNI requiring approximately 25 ms per patch despite strong fine-tuning performance. This demonstrates that CustomViT provides a favorable balance between accuracy and computational efficiency for large-scale patch-based analysis.
  • Figure 3: Scaling behavior of classification performance with increasing training set size (flag_limit). Macro F1 scores under clean conditions are shown for task-specific models and foundation models with fine-tuning of the last layer (FT_last). CustomViT exhibits monotonic performance gains and surpasses foundation models at moderate data scales, whereas conventional convolutional networks saturate. EfficientNet shows early gains but was not evaluated at the largest scale due to excessive training time.
  • Figure 4: Robustness to strong blur ($\sigma = 1.6$) under a fixed annotation budget (flag_limit = 4096). Macro-F1 scores are shown for clean images (blue), post-resize blur (orange), and pre-resize blur (green). Blur is applied either after resizing to $40 \times 40$ pixels (post) or before resizing with resolution-corrected strength (pre). All models show pronounced performance degradation at this severity, with comparable sensitivity to pre- and post-resize blur.
  • Figure 5: Accuracy degradation rate with increasing blur severity. Bars indicate the slope of accuracy change per unit increase in $\log_2(1+\sigma)$, with negative values representing faster performance degradation. Pre-resize (optical) and post-resize (digital) blur are shown separately.
  • ...and 6 more figures