Table of Contents
Fetching ...

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Meng Cui, Xubo Liu, Haohe Liu, Zhuangzhuang Du, Tao Chen, Guoping Lian, Daoliang Li, Wenwu Wang

TL;DR

This work introduces a novel unified model, termed as U-FFIA, a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation from single-modality pre-trained models.

Abstract

Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio and video clips that capture different levels of fish feeding intensity. Then, we introduce multi-modal approaches for FFIA by leveraging the models pre-trained on individual modalities and fused with data fusion methods. We perform benchmark studies of these methods on AV-FFIA, and demonstrate the advantages of the multi-modal approach over the single-modality based approach, especially in noisy environments. However, compared to the methods developed for individual modalities, the multimodal approaches may involve higher computational costs due to the need for independent encoders for each modality. To overcome this issue, we further present a novel unified mixed-modality based method for FFIA, termed as U-FFIA. U-FFIA is a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation using the models pre-trained with data from single modality. We demonstrate that U-FFIA can achieve performance better than or on par with the state-of-the-art modality-specific FFIA models, with significantly lower computational overhead, enabling robust and efficient FFIA for improved aquaculture management.

Multimodal Fish Feeding Intensity Assessment in Aquaculture

TL;DR

This work introduces a novel unified model, termed as U-FFIA, a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation from single-modality pre-trained models.

Abstract

Fish feeding intensity assessment (FFIA) aims to evaluate fish appetite changes during feeding, which is crucial in industrial aquaculture applications. Existing FFIA methods are limited by their robustness to noise, computational complexity, and the lack of public datasets for developing the models. To address these issues, we first introduce AV-FFIA, a new dataset containing 27,000 labeled audio and video clips that capture different levels of fish feeding intensity. Then, we introduce multi-modal approaches for FFIA by leveraging the models pre-trained on individual modalities and fused with data fusion methods. We perform benchmark studies of these methods on AV-FFIA, and demonstrate the advantages of the multi-modal approach over the single-modality based approach, especially in noisy environments. However, compared to the methods developed for individual modalities, the multimodal approaches may involve higher computational costs due to the need for independent encoders for each modality. To overcome this issue, we further present a novel unified mixed-modality based method for FFIA, termed as U-FFIA. U-FFIA is a single model capable of processing audio, visual, or audio-visual modalities, by leveraging modality dropout during training and knowledge distillation using the models pre-trained with data from single modality. We demonstrate that U-FFIA can achieve performance better than or on par with the state-of-the-art modality-specific FFIA models, with significantly lower computational overhead, enabling robust and efficient FFIA for improved aquaculture management.
Paper Structure (39 sections, 9 equations, 5 figures, 5 tables)

This paper contains 39 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed method. We first downsample the original video and audio, and then use a video encoder and audio encoder to extract the video and audio features, respectively. The audio encoder is a pre-trained MobileNetV2 and the video encoder is simply a linear projection layer. We cut the video features into non-overlap patches 16.0 $\times$ 16.0, and use whole audio features. We use a modality dropout to randomly select one modality on each step during the model training. The shared Transformer encoder has 6.0 layers with 8.0 heads, embedding dimension 768.0, and FFN dimension 1024.0, using the pre-norm residual connection setup. For single-modality tasks, we employ knowledge distillation with pre-trained S3D (video) and CNN10 (audio) as teachers. For audio-visual fusion, we use cross-attention to incorporate audio cues into video frame representations.
  • Figure 2: Experimental systems for data collection. A hydrophone was underwater and the camera was deployed on a tripod with a height of about two meters to capture the video data.
  • Figure 3: Video frames and mel spectrogram visualizations of four different fish feeding intensity: "Strong", "Medium", "Weak" and "None".
  • Figure 4: Visualization of the impact of SimPFs on the mel-spectrogram of a FFIA audio clip with 50% compression factor.
  • Figure 5: Real aquaculture environment for Tilapia rearing. The turbidity of the water and low lighting conditions make it challenging to visually observe the fish. (a) shows a video frame during fish feeding, and (b) shows the mel-spectrogram of the corresponding audio clip.