Table of Contents
Fetching ...

Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture

Meng Cui, Xianghu Yue, Xinyuan Qian, Jinzheng Zhao, Haohe Liu, Xubo Liu, Daoliang Li, Wenwu Wang

TL;DR

This paper tackles the challenge of scalable FFIA across new fish species and environments by introducing AV-CIL-FFIA, a large audio-visual dataset designed for class-incremental learning, and proposing HAIL-FFIA, a hierarchical exemplar-free framework. HAIL-FFIA combines dual-encoder AV fusion, a two-tier representation that separates general intensity patterns from species-specific cues, and a prototype-based memory system to mitigate forgetting without storing raw data. The approach uses closed-form ridge updates and a dynamic modality balancing mechanism to adaptively fuse audio and visual information as new species are introduced, achieving state-of-the-art performance with low storage overhead (roughly 0.1% of raw data) and reduced forgetting in incremental FFIA. Experimental results demonstrate that HAIL-FFIA outperforms exemplar-free and exemplar-based baselines across AV, audio-only, and visual-only modalities, highlighting the practical impact for resource-constrained aquaculture monitoring systems and providing a solid benchmark for future multimodal continual learning in FFIA.

Abstract

Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achieves exemplar-free efficiency while preserving essential knowledge through compact feature representations. Specifically, HAIL-FFIA employs hierarchical representation learning with a dual-path knowledge preservation mechanism that separates general intensity knowledge from fish-specific characteristics. Additionally, it features a dynamic modality balancing system that adaptively adjusts the importance of audio versus visual information based on feeding behaviour stages. Experimental results show that HAIL-FFIA is superior to SOTA methods on AV-CIL-FFIA, achieving higher accuracy with lower storage needs while effectively mitigating catastrophic forgetting in incremental fish species learning.

Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture

TL;DR

This paper tackles the challenge of scalable FFIA across new fish species and environments by introducing AV-CIL-FFIA, a large audio-visual dataset designed for class-incremental learning, and proposing HAIL-FFIA, a hierarchical exemplar-free framework. HAIL-FFIA combines dual-encoder AV fusion, a two-tier representation that separates general intensity patterns from species-specific cues, and a prototype-based memory system to mitigate forgetting without storing raw data. The approach uses closed-form ridge updates and a dynamic modality balancing mechanism to adaptively fuse audio and visual information as new species are introduced, achieving state-of-the-art performance with low storage overhead (roughly 0.1% of raw data) and reduced forgetting in incremental FFIA. Experimental results demonstrate that HAIL-FFIA outperforms exemplar-free and exemplar-based baselines across AV, audio-only, and visual-only modalities, highlighting the practical impact for resource-constrained aquaculture monitoring systems and providing a solid benchmark for future multimodal continual learning in FFIA.

Abstract

Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achieves exemplar-free efficiency while preserving essential knowledge through compact feature representations. Specifically, HAIL-FFIA employs hierarchical representation learning with a dual-path knowledge preservation mechanism that separates general intensity knowledge from fish-specific characteristics. Additionally, it features a dynamic modality balancing system that adaptively adjusts the importance of audio versus visual information based on feeding behaviour stages. Experimental results show that HAIL-FFIA is superior to SOTA methods on AV-CIL-FFIA, achieving higher accuracy with lower storage needs while effectively mitigating catastrophic forgetting in incremental fish species learning.

Paper Structure

This paper contains 35 sections, 21 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed Audio-Visual Class-Incremental learning framework. (a) Audio-Visual Fusion Backbone Training uses cross-modal attention to integrate complementary features from both modalities. (b) Hierarchical Representation Learning separates general feeding intensity knowledge from species-specific features, creating distinct prototype banks. FE(Feature Expansion) increases the representational capacity of the model through a linear transformation that projects features into a higher-dimensional space. (c) During incremental learning at steps $k-1$ and $k$, the framework preserves knowledge through prototype memory banks while adapting to new species without storing raw examples. EMA (Exponential Moving Average) refers to our prototype updating mechanism that uses a weighted average of previous and new prototypes, allowing gradual adaptation to new fish species while preserving knowledge of previous ones.
  • Figure 2: Experimental systems for data collection. A hydrophone was underwater, and the camera was deployed on a tripod at a height of about two meters to capture the video data.
  • Figure 3: Comparison of video frames (top) and audio mel-spectrograms (bottom) across the six fish species in the AV-CIL-FFIA dataset: (a) Tilapia, (b) Lotus Carp, (c) Black Perch, (d) Sunfish, (e) Jade Perch, and (f) Red Tilapia. The video frames demonstrate the challenging visual conditions in real aquaculture environments, while the mel-spectrograms reveal distinct acoustic signatures characteristic of each species during feeding.
  • Figure 4: Testing accuracy at each incremental step on AV-CIL-FFIA. The results show that as the incremental step increases, our methods generally outperform other state-of-the-art incremental learning methods.
  • Figure 5: Effect of prototype count on model performance and storage requirements. The blue line (left y-axis) shows average accuracy across all fish species, while the red line (right y-axis) indicates storage requirements as a percentage of raw data. Five prototypes per intensity level provide an optimal balance between accuracy and memory efficiency.