Table of Contents
Fetching ...

Evaluation of Audio Compression Codecs

Thien T. Duong, Jan P. Springer

TL;DR

The paper addresses the need to evaluate audio codecs not only on compression efficiency but also on perceptual quality. It combines traditional metrics, visual spectral analyses, and PEAQ-based scores to compare FLAC, MP3, AAC, Vorbis, and an AI-based RVQGAN, revealing that lossy codecs often trade fidelity for size, with Vorbis performing notably well among lossy codecs. AI-driven RVQGAN delivers extreme compression but shows poor perceptual quality and playback compatibility, highlighting current limitations and the need for psychoacoustic integration. The results provide practical guidance for codec selection and underscore the importance of perceptual assessment in real-world applications, especially as AI-based approaches mature and streaming contexts grow.

Abstract

Perceptual quality of audio is the combination of aural accuracy and listener-perceived sound fidelity. It is how humans respond to the accuracy, intelligibility, and fidelity of aural media. Today this fidelity is also heavily influenced by the use of audio compression codecs for storing aural media in digital form. We argue that, when choosing an audio compression codec, users should not only look at compression efficiency but also consider the sonic perceptual quality properties of available audio compression codecs. We evaluate several commonly used audio compression codecs in terms of compression performance as well as their sonic perceptual quality via codec performance measurements, visualizations, and PEAQ scores. We demonstrate how perceptual quality is affected by digital audio compression techniques, providing insights for users in the process of choosing a digital audio compression scheme.

Evaluation of Audio Compression Codecs

TL;DR

The paper addresses the need to evaluate audio codecs not only on compression efficiency but also on perceptual quality. It combines traditional metrics, visual spectral analyses, and PEAQ-based scores to compare FLAC, MP3, AAC, Vorbis, and an AI-based RVQGAN, revealing that lossy codecs often trade fidelity for size, with Vorbis performing notably well among lossy codecs. AI-driven RVQGAN delivers extreme compression but shows poor perceptual quality and playback compatibility, highlighting current limitations and the need for psychoacoustic integration. The results provide practical guidance for codec selection and underscore the importance of perceptual assessment in real-world applications, especially as AI-based approaches mature and streaming contexts grow.

Abstract

Perceptual quality of audio is the combination of aural accuracy and listener-perceived sound fidelity. It is how humans respond to the accuracy, intelligibility, and fidelity of aural media. Today this fidelity is also heavily influenced by the use of audio compression codecs for storing aural media in digital form. We argue that, when choosing an audio compression codec, users should not only look at compression efficiency but also consider the sonic perceptual quality properties of available audio compression codecs. We evaluate several commonly used audio compression codecs in terms of compression performance as well as their sonic perceptual quality via codec performance measurements, visualizations, and PEAQ scores. We demonstrate how perceptual quality is affected by digital audio compression techniques, providing insights for users in the process of choosing a digital audio compression scheme.

Paper Structure

This paper contains 18 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Visualization of perceptual audio quality. \ref{['fig:setup:visual:spectrum+sample']} Audacity generated spectrum shows peak loudness within the highest frequency ([48]kHz); x-axis shows time, y-axis shows frequency, colored spectrum indicates the loudness in []dB. \ref{['fig:setup:visual:spectrogram']} Spek generated spectrogram shows loudness range ([0]dB to [-120]dB) of audio signal's frequency range; x-axis shows time, y-axis shows frequency range, colored spectrum indicates loudness in []dB. \ref{['fig:setup:visual:sound+field']} Sound field visualization generated by Audacity using the Insight 2 plug-in shows width, depth, and height of the sound stage as well as the location of individual sound sources within the sound stage. The half-circle line indicates front of the sound stage while bright dots indicate stereo images. All visualizations based on Dreams by Fleetwood Mac (2001 remaster vinyl quality, [96]kHz sample rate).
  • Figure 2: Spectra for all tested audio codecs (exported from Audacity). X-axis denotes the frequency range up to [24]kHz. The y-axis shows the loudness range from [0]dB to [-150]dB. The color-coded graphs refer to uncompressed (black), FLAC level 6 (red), MP3 CBR [320]kbps (yellow), MP3 CBR [128]kbps (green), AAC CBR [256]kbps (orange), AAC VBR level 5 (teal), and Vorbis VBR level 7 (pink). The graphs represent encoded audio signal outputs for the respective audio codec. Note the spectrum for the uncompressed audio signal is completely aligned with the spectrum of FLAC level 6, showing that FLAC encoded audio exhibits the exact same loudness as the uncompressed audio signal across the entire frequency range.
  • Figure 3: Audible loudness comparison for the track Iron Man by Black Sabbath using different encoding schemes. \ref{['fig:evaluation:iron+man:a']} shows the spectrogram of the uncompressed signal while \ref{['fig:evaluation:iron+man:b']} -- \ref{['fig:evaluation:iron+man:g']} show spectrograms for FLAC, MP3, AAC, and Vorbis, respectively. Note that MP3 requires a version of the track scaled down from [96]kHz to [48]kHz; the diagram for MP3 has been scaled in turn by a factor of two. All x-axes show duration of the tested audio signal while y-axes show the frequency range of the tested audio signal. Spectrograms indicate loudness in []dB.
  • Figure 4: Sound-field visualizations for the track Iron Man by Black Sabbath compressed using different audio encoding schemes. \ref{['fig:evaluation:iron+man+sf:a']} shows the sound-field visualization from [3:00 -- 3:30]min for the uncompressed signal and \ref{['fig:evaluation:iron+man+sf:b']} -- \ref{['fig:evaluation:iron+man+sf:g']} show the sound-field visualization for the same time interval encoded with FLAC, MP3s, AACs, and Vorbis, respectively. The patterns of dots should lean to the right, capturing the stereo image of the electric guitar's sound in the recording. Compared to the uncompressed audio signal's stereo image, the FLAC encoded audio signal was able to accurately capture the stereo image while audio signals from other lossy encoders show some inaccuracies.
  • Figure 5: RVQGAN's spectrum (red) together with the uncompressed signals spectrum (blue) (exported from Audacity). X-axis shows the frequency range of the sample track in []Hz, y-axis shows loudness range from [0]dB to [-125]dB. Compared to the uncompressed audio signal, the RVQGAN encoded audio signal shows a slight increase in loudness before dropping off at around [10]kHz, gradually reducing in volume towards higher frequencies.
  • ...and 2 more figures