Table of Contents
Fetching ...

FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields

Lukas Meyer, Andrei-Timotei Ardelean, Tim Weyrich, Marc Stamminger

TL;DR

FruitNeRF++ addresses the need for general, shape-agnostic fruit counting from multi-view orchard imagery by integrating a neural radiance field with a neural instance field learned through contrastive training. It leverages vision foundation models to obtain per-fruit instance masks and fuses RGB, semantic, and instance information into a 3D representation that can be clustered to yield counts without fruit-specific templates. The approach introduces a cascaded training scheme, a tailored contrastive loss with fruit prototypes, and a two-stage multi-modal clustering pipeline (spatial partitioning followed by HDBSCAN) to achieve robust counting across diverse fruit types. Evaluation on synthetic data (six fruits) and a real Fuji dataset demonstrates favorable performance and practical advantages over prior, more specialized methods, with insights into embedding size, temperature, and distance weighting that guide practical deployment.

Abstract

We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.

FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields

TL;DR

FruitNeRF++ addresses the need for general, shape-agnostic fruit counting from multi-view orchard imagery by integrating a neural radiance field with a neural instance field learned through contrastive training. It leverages vision foundation models to obtain per-fruit instance masks and fuses RGB, semantic, and instance information into a 3D representation that can be clustered to yield counts without fruit-specific templates. The approach introduces a cascaded training scheme, a tailored contrastive loss with fruit prototypes, and a two-stage multi-modal clustering pipeline (spatial partitioning followed by HDBSCAN) to achieve robust counting across diverse fruit types. Evaluation on synthetic data (six fruits) and a real Fuji dataset demonstrates favorable performance and practical advantages over prior, more specialized methods, with insights into embedding size, temperature, and distance weighting that guide practical deployment.

Abstract

We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Rendering of RGB, semantic and instance images. For visualization of the results visit the project page: https://meyerls.github.io/fruit_nerfpp.
  • Figure 2: Pipeline of FruitNeRF++. For the images we recover both intrinsic and extrinsic camera parameters. We then extract semantic and instance masks for arbitrary fruit types using SAM kirillov2023segany and Detic detic. The data are used to train a neural radiance field with an neural appearance, semantic (fruit) and instance field. By clustering the combination of a fruit and instance point cloud we obtain a precise fruit count.
  • Figure 3: Overview of the FruitNeRF++ architecture, split up into four different components: density field, appearance field, fruit field, and instance field. The density field encodes the volume density $\sigma$, the appearance field the color $\textit{RGB}$, the Fruit Field the semantic information about the fruit in space, and the instance field a feature vector $\boldsymbol{i}$ encoding information about the instance group of a point in space. The dashed arrow indicates the flow direction of the gradient. For training the different fields we employed a cascaded training scheme. First we train the density and $\textit{RGB}$ alone, followed by activating the semantic Fruit Field. Lastly, we freeze all three neural fields and train only the instance field. The figure is adapted from Özer et al. thermalnerf. The colors of the dashed arrows correspond to their individual loss function in Eq. \ref{['eq:loss_func']}
  • Figure 4: In Fig. (a) we visualize the concept of local and global negatives. Local negatives are a collection of multiple fruits in near vicinity (see Sec. \ref{['ssec:pixelsampler']}). By selecting these hard negatives, we enforce the features of neighbouring fruits to be distinct, facilitating their separation during the clustering stage. Global (weak) negatives are then used to separate distant fruits. Fig (b) visualizes both pixel sampling and local negatives in detail. The pixels from the orange center are denoted as positive and all others as negative. By computing the contrastive loss function between every pixel's feature vector and the mean feature vector, we attract positive (yellow pixels) and repulse negative features (pink pixels).
  • Figure 5: Counting result on variation of the embedding size $D$ is depicted on the left. Temperature is set to $\tau = 0.2$. $\lambda_e=0$ and $\lambda_c=1$ to show the impact of the embedding size only. In the middle the sweep over temperature $\tau$ is shown. Here we set $\lambda_e = 1$ and $\lambda_c = 1$. On the right shows the cluster performance via parameter sweep over $\lambda_e$ while setting $\lambda_c=1$. The red area describes the feasible area for parameters $D$ and $\tau$.