Table of Contents
Fetching ...

ConvDTW-ACS: Audio Segmentation for Track Type Detection During Car Manufacturing

Álvaro López-Chilet, Zhaoyi Liu, Jon Ander Gómez, Carlos Alvarez, Marivi Alonso Ortiz, Andres Orejuela Mesa, David Newton, Friedrich Wolf-Monheim, Sam Michiels, Danny Hughes

TL;DR

ConvDTW-ACS tackles automatic segmentation of standardized car-test-track audio to delineate surface-type boundaries. It combines spectrogram-based chunk classification via a CNN with an Acoustic Constrained DTW postprocessing step that enforces the known surface order and minimum per-surface durations, producing boundary timestamps. On real Ford Valencia data, it achieves a mean boundary error of approximately 166 ms and demonstrates how ACS constraints, spectrogram choice, and data augmentation choices affect segmentation performance, with implications for automated quality inspection in Industry 4.0. The approach enables faster, more precise surface-aware quality analysis and can integrate with downstream anomaly detection to improve manufacturing fault detection.

Abstract

This paper proposes a method for Acoustic Constrained Segmentation (ACS) in audio recordings of vehicles driven through a production test track, delimiting the boundaries of surface types in the track. ACS is a variant of classical acoustic segmentation where the sequence of labels is known, contiguous and invariable, which is especially useful in this work as the test track has a standard configuration of surface types. The proposed ConvDTW-ACS method utilizes a Convolutional Neural Network for classifying overlapping image chunks extracted from the full audio spectrogram. Then, our custom Dynamic Time Warping algorithm aligns the sequence of predicted probabilities to the sequence of surface types in the track, from which timestamps of the surface type boundaries can be extracted. The method was evaluated on a real-world dataset collected from the Ford Manufacturing Plant in Valencia (Spain), achieving a mean error of 166 milliseconds when delimiting, within the audio, the boundaries of the surfaces in the track. The results demonstrate the effectiveness of the proposed method in accurately segmenting different surface types, which could enable the development of more specialized AI systems to improve the quality inspection process.

ConvDTW-ACS: Audio Segmentation for Track Type Detection During Car Manufacturing

TL;DR

ConvDTW-ACS tackles automatic segmentation of standardized car-test-track audio to delineate surface-type boundaries. It combines spectrogram-based chunk classification via a CNN with an Acoustic Constrained DTW postprocessing step that enforces the known surface order and minimum per-surface durations, producing boundary timestamps. On real Ford Valencia data, it achieves a mean boundary error of approximately 166 ms and demonstrates how ACS constraints, spectrogram choice, and data augmentation choices affect segmentation performance, with implications for automated quality inspection in Industry 4.0. The approach enables faster, more precise surface-aware quality analysis and can integrate with downstream anomaly detection to improve manufacturing fault detection.

Abstract

This paper proposes a method for Acoustic Constrained Segmentation (ACS) in audio recordings of vehicles driven through a production test track, delimiting the boundaries of surface types in the track. ACS is a variant of classical acoustic segmentation where the sequence of labels is known, contiguous and invariable, which is especially useful in this work as the test track has a standard configuration of surface types. The proposed ConvDTW-ACS method utilizes a Convolutional Neural Network for classifying overlapping image chunks extracted from the full audio spectrogram. Then, our custom Dynamic Time Warping algorithm aligns the sequence of predicted probabilities to the sequence of surface types in the track, from which timestamps of the surface type boundaries can be extracted. The method was evaluated on a real-world dataset collected from the Ford Manufacturing Plant in Valencia (Spain), achieving a mean error of 166 milliseconds when delimiting, within the audio, the boundaries of the surfaces in the track. The results demonstrate the effectiveness of the proposed method in accurately segmenting different surface types, which could enable the development of more specialized AI systems to improve the quality inspection process.
Paper Structure (16 sections, 1 equation, 2 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Diagram of the proposed ConvDTW-ACS method. First, the raw waveform is preprocessed to extract the spectrogram chunks. Then, a CNN model is used to classify each chunk among all the surfaces. Finally, our ACS-DTW algorithm is applied to align the model predictions to the order of the surfaces in the track. This produces the final segmentation composed by the boundary timestamps between surfaces.
  • Figure 2: Diagram of the data preprocessing pipeline using the Mel spectrogram as feature extraction. It shows how the input samples (chunks) for the CNN classifier are extracted from the raw audio waveform. The dimensions of each chunk image are $(channels=1, height=Mel\ bands, width=chunk\_size)$. For each chunk a surface label is assigned taking into account the middle frame of the chunk, generating the ground-truth for the CNN classifier.