Table of Contents
Fetching ...

Tomato Maturity Recognition with Convolutional Transformers

Asim Khan, Taimur Hassan, Muhammad Shafay, Israa Fahmy, Naoufel Werghi, Lakmal Seneviratne, Irfan Hussain

TL;DR

This work tackles automatic tomato maturity recognition by introducing a convolutional-transformer architecture that combines a cascaded transformer block with a five-level encoder and a multi-branch decoder. A novel dataset, KUTomaData, captures tomatoes in UAE greenhouses under diverse lighting and occlusion, labeled into three maturity classes for robust segmentation and grading. Across KUTomaData and two public datasets (Laboro Tomato and Rob2Pheno Annotated Tomato), the proposed method achieves superior Dice, IoU, mAP, and AUC metrics, outperforming state-of-the-art baselines and demonstrating strong real-time applicability for robotic harvesting. The paper also presents an extensive ablation study on hyperparameters, backbone, and loss design, and releases a new dataset to advance research in tomato object segmentation and maturity classification.

Abstract

Tomatoes are a major crop worldwide, and accurately classifying their maturity is important for many agricultural applications, such as harvesting, grading, and quality control. In this paper, the authors propose a novel method for tomato maturity classification using a convolutional transformer. The convolutional transformer is a hybrid architecture that combines the strengths of convolutional neural networks (CNNs) and transformers. Additionally, this study introduces a new tomato dataset named KUTomaData, explicitly designed to train deep-learning models for tomato segmentation and classification. KUTomaData is a compilation of images sourced from a greenhouse in the UAE, with approximately 700 images available for training and testing. The dataset is prepared under various lighting conditions and viewing perspectives and employs different mobile camera sensors, distinguishing it from existing datasets. The contributions of this paper are threefold:Firstly, the authors propose a novel method for tomato maturity classification using a modular convolutional transformer. Secondly, the authors introduce a new tomato image dataset that contains images of tomatoes at different maturity levels. Lastly, the authors show that the convolutional transformer outperforms state-of-the-art methods for tomato maturity classification. The effectiveness of the proposed framework in handling cluttered and occluded tomato instances was evaluated using two additional public datasets, Laboro Tomato and Rob2Pheno Annotated Tomato, as benchmarks. The evaluation results across these three datasets demonstrate the exceptional performance of our proposed framework, surpassing the state-of-the-art by 58.14%, 65.42%, and 66.39% in terms of mean average precision scores for KUTomaData, Laboro Tomato, and Rob2Pheno Annotated Tomato, respectively.

Tomato Maturity Recognition with Convolutional Transformers

TL;DR

This work tackles automatic tomato maturity recognition by introducing a convolutional-transformer architecture that combines a cascaded transformer block with a five-level encoder and a multi-branch decoder. A novel dataset, KUTomaData, captures tomatoes in UAE greenhouses under diverse lighting and occlusion, labeled into three maturity classes for robust segmentation and grading. Across KUTomaData and two public datasets (Laboro Tomato and Rob2Pheno Annotated Tomato), the proposed method achieves superior Dice, IoU, mAP, and AUC metrics, outperforming state-of-the-art baselines and demonstrating strong real-time applicability for robotic harvesting. The paper also presents an extensive ablation study on hyperparameters, backbone, and loss design, and releases a new dataset to advance research in tomato object segmentation and maturity classification.

Abstract

Tomatoes are a major crop worldwide, and accurately classifying their maturity is important for many agricultural applications, such as harvesting, grading, and quality control. In this paper, the authors propose a novel method for tomato maturity classification using a convolutional transformer. The convolutional transformer is a hybrid architecture that combines the strengths of convolutional neural networks (CNNs) and transformers. Additionally, this study introduces a new tomato dataset named KUTomaData, explicitly designed to train deep-learning models for tomato segmentation and classification. KUTomaData is a compilation of images sourced from a greenhouse in the UAE, with approximately 700 images available for training and testing. The dataset is prepared under various lighting conditions and viewing perspectives and employs different mobile camera sensors, distinguishing it from existing datasets. The contributions of this paper are threefold:Firstly, the authors propose a novel method for tomato maturity classification using a modular convolutional transformer. Secondly, the authors introduce a new tomato image dataset that contains images of tomatoes at different maturity levels. Lastly, the authors show that the convolutional transformer outperforms state-of-the-art methods for tomato maturity classification. The effectiveness of the proposed framework in handling cluttered and occluded tomato instances was evaluated using two additional public datasets, Laboro Tomato and Rob2Pheno Annotated Tomato, as benchmarks. The evaluation results across these three datasets demonstrate the exceptional performance of our proposed framework, surpassing the state-of-the-art by 58.14%, 65.42%, and 66.39% in terms of mean average precision scores for KUTomaData, Laboro Tomato, and Rob2Pheno Annotated Tomato, respectively.
Paper Structure (31 sections, 8 equations, 6 figures, 8 tables)

This paper contains 31 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An architectural diagram of the proposed framework for tomato maturity level recognition and grading. The proposed framework consists of the transformer, encoder, and decoder blocks. The input scan is initially passed to the transformer and encoder block. Across the transformer end, the input scan is divided into a set of image patches, against which the positional embeddings are computed. These positional embeddings and linear projections of the image patches are combined and are passed to the $t$-layered transformer block, which generates the projectional features to differentiate tomato grades. Similarly, the latent feature representations are computed from the input scan using the residual and shape preservation blocks at the encoder block. These latent space representations are then fused with the projectional features of the transformer end to boost the separation between different tomato grades. Finally, the decoder block removes extraneous elements through rescaling and max un-pooling operations, resulting in accurate segmentation and grading of tomato maturity levels.
  • Figure 2: The dataset of tomato images contains samples of tomatoes captured in different stages of ripeness and under varying lighting conditions and occlusion. The images in the dataset are organized into three columns. The first column showcases unripened tomatoes, the second column shows half-ripe and unripened tomatoes, and the third column presents fully-ripened tomatoes with some half-ripened and some unripened tomatoes. This division allows for clear differentiation and visual representation of the different ripeness stages of the tomatoes in the dataset.
  • Figure 3: Sub-figures (a) and (b) depict the Loss and Accuracy curves, respectively, for several network models during both training and validation stages. The models include the Proposed Model (Our), UNet, PSPNet, and SegNet.
  • Figure 4: Here are some examples of data augmentation techniques: (a). Original image, (b). Random brightness, (c) Horizontal flip, (d). Random rotation, (e). Salt & pepper, (f). Speckle effect, (f). Vertical variation, and (f). Zoom variation.
  • Figure 5: The authors compared the proposed framework with the best existing models to evaluate how well it would work. Here, the raw test images from our dataset are displayed in Column 1, the ground truth labels are displayed in Column 2, the results of the proposed framework are displayed in Column 3, and those of PSPNet, SegNet, and UNet are displayed in Columns 4-6, respectively.
  • ...and 1 more figures