Table of Contents
Fetching ...

Almost Right: Making First-Layer Kernels Nearly Orthogonal Improves Model Generalization

Colton R. Crum, Adam Czajka

TL;DR

This work tackles poor open-set generalization in CNNs, especially in biometric and medical domains, by introducing Almost Right, a lightweight, architecture-agnostic regularizer that enforces near-orthogonality among a subset of the first-layer kernels. The method uses a differentiable, loss-based approach that considers pairwise kernel relationships and flattens channels to enable applicability across arbitrary CNNs and even Vision Transformers, without heavy architectural changes. The authors define a combined loss with a cross-entropy term and a softly orthogonal term, evaluated across three open-set domains (iris PAD, chest X-ray anomalies, and synthetic face detection) and five architectures, showing consistent gains over state-of-the-art orthogonal and saliency-based regularizers. Results indicate notable improvements in open-set AUROC, particularly for ConvNext, ViT, and Swin architectures, and ablations suggest a balanced regularization strength around $\alpha=0.5$ yields robust improvements. The work demonstrates practical, cross-architecture effectiveness and provides source code to facilitate replication and further research in open-set generalization.

Abstract

Despite several algorithmic advances in the training of convolutional neural networks (CNNs) over the years, their generalization capabilities are still subpar across several pertinent domains, particularly within open-set tasks often found in biometric and medical contexts. On the contrary, humans have an uncanny ability to generalize to unknown visual stimuli. The efficient coding hypothesis posits that early visual structures (retina, Lateral Geniculate Nucleus, and primary visual cortex) transform inputs to reduce redundancy and maximize information efficiency. This mechanism of redundancy minimization in early vision was the inspiration for CNN regularization techniques that force convolutional kernels to be orthogonal. However, the existing works rely upon matrix projections, architectural modifications, or specific weight initializations, which frequently overtly constrain the network's learning process and excessively increase the computational load during loss function calculation. In this paper, we introduce a flexible and lightweight approach that regularizes a subset of first-layer convolutional filters by making them pairwise-orthogonal, which reduces the redundancy of the extracted features but at the same time prevents putting excessive constraints on the network. We evaluate the proposed method on three open-set visual tasks (anomaly detection in chest X-ray images, synthetic face detection, and iris presentation attack detection) and observe an increase in the generalization capabilities of models trained with the proposed regularizer compared to state-of-the-art kernel orthogonalization approaches. We offer source codes along with the paper.

Almost Right: Making First-Layer Kernels Nearly Orthogonal Improves Model Generalization

TL;DR

This work tackles poor open-set generalization in CNNs, especially in biometric and medical domains, by introducing Almost Right, a lightweight, architecture-agnostic regularizer that enforces near-orthogonality among a subset of the first-layer kernels. The method uses a differentiable, loss-based approach that considers pairwise kernel relationships and flattens channels to enable applicability across arbitrary CNNs and even Vision Transformers, without heavy architectural changes. The authors define a combined loss with a cross-entropy term and a softly orthogonal term, evaluated across three open-set domains (iris PAD, chest X-ray anomalies, and synthetic face detection) and five architectures, showing consistent gains over state-of-the-art orthogonal and saliency-based regularizers. Results indicate notable improvements in open-set AUROC, particularly for ConvNext, ViT, and Swin architectures, and ablations suggest a balanced regularization strength around yields robust improvements. The work demonstrates practical, cross-architecture effectiveness and provides source code to facilitate replication and further research in open-set generalization.

Abstract

Despite several algorithmic advances in the training of convolutional neural networks (CNNs) over the years, their generalization capabilities are still subpar across several pertinent domains, particularly within open-set tasks often found in biometric and medical contexts. On the contrary, humans have an uncanny ability to generalize to unknown visual stimuli. The efficient coding hypothesis posits that early visual structures (retina, Lateral Geniculate Nucleus, and primary visual cortex) transform inputs to reduce redundancy and maximize information efficiency. This mechanism of redundancy minimization in early vision was the inspiration for CNN regularization techniques that force convolutional kernels to be orthogonal. However, the existing works rely upon matrix projections, architectural modifications, or specific weight initializations, which frequently overtly constrain the network's learning process and excessively increase the computational load during loss function calculation. In this paper, we introduce a flexible and lightweight approach that regularizes a subset of first-layer convolutional filters by making them pairwise-orthogonal, which reduces the redundancy of the extracted features but at the same time prevents putting excessive constraints on the network. We evaluate the proposed method on three open-set visual tasks (anomaly detection in chest X-ray images, synthetic face detection, and iris presentation attack detection) and observe an increase in the generalization capabilities of models trained with the proposed regularizer compared to state-of-the-art kernel orthogonalization approaches. We offer source codes along with the paper.

Paper Structure

This paper contains 35 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: A sample of the three open-set recognition tasks evaluated in this paper: Iris presentation attack detection (first column), Chest X-ray abnormalities (middle), and Synthetic face detection (third). The first row indicates live (Iris PAD), normal (Chest X-rays), and real samples, whereas the bottom row indicates spoof, abnormal, and fake samples.