Table of Contents
Fetching ...

Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning

Chunyan Zeng, Yuhao Zhao, Zhifeng Wang

TL;DR

This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales, and thoroughly investigates the transferability of the model, achieving an 87.9% accuracy in a classification task on a new dataset.

Abstract

This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.

Mobile Recording Device Recognition Based Cross-Scale and Multi-Level Representation Learning

TL;DR

This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales, and thoroughly investigates the transferability of the model, achieving an 87.9% accuracy in a classification task on a new dataset.

Abstract

This paper introduces a modeling approach that employs multi-level global processing, encompassing both short-term frame-level and long-term sample-level feature scales. In the initial stage of shallow feature extraction, various scales are employed to extract multi-level features, including Mel-Frequency Cepstral Coefficients (MFCC) and pre-Fbank log energy spectrum. The construction of the identification network model involves considering the input two-dimensional temporal features from both frame and sample levels. Specifically, the model initially employs one-dimensional convolution-based Convolutional Long Short-Term Memory (ConvLSTM) to fuse spatiotemporal information and extract short-term frame-level features. Subsequently, bidirectional long Short-Term Memory (BiLSTM) is utilized to learn long-term sample-level sequential representations. The transformer encoder then performs cross-scale, multi-level processing on global frame-level and sample-level features, facilitating deep feature representation and fusion at both levels. Finally, recognition results are obtained through Softmax. Our method achieves an impressive 99.6% recognition accuracy on the CCNU_Mobile dataset, exhibiting a notable improvement of 2% to 12% compared to the baseline system. Additionally, we thoroughly investigate the transferability of our model, achieving an 87.9% accuracy in a classification task on a new dataset.

Paper Structure

This paper contains 27 sections, 39 equations, 10 figures, 9 tables, 3 algorithms.

Figures (10)

  • Figure 1: Flowchart of mobile recording device recognition problem.
  • Figure 2: Overall framework of the proposed cross-scale and multi-level representation learning based mobile recording device recognition method.
  • Figure 3: Process of MFCC-based timing tandem feature extraction.
  • Figure 4: Schematic of the dynamic range of the timing tandem feature.
  • Figure 5: Schematic diagram of the module for processing short-time frame-level features.
  • ...and 5 more figures