Table of Contents
Fetching ...

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo, Steven Davy, Radu Timofte

TL;DR

A level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings that simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views.

Abstract

Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

TL;DR

A level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings that simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views.

Abstract

Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP
Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed preprocessing pipeline based on Grounding DINO liu2024grounding and CLIP chen2022contrastive that encodes images into rich visual embeddings for downstream tasks.
  • Figure 2: The proposed multi-task pipelines: (a) shows unimodal baseline based on CLIP vision embeddings, (b) shows conditioning CLIP on level priors for multimodal regression.
  • Figure 3: MAE of both approaches as a function of the percentage of viewpoint images removed at runtime.
  • Figure 4: Overall degradation from 0% removal to only 1 image remaining.