Table of Contents
Fetching ...

Deep Layer Aggregation

Fisher Yu, Dequan Wang, Evan Shelhamer, Trevor Darrell

TL;DR

This work introduces Deep Layer Aggregation (DLA), a backbone-agnostic framework that deeply fuses semantic information and spatial resolution through iterative (IDA) and hierarchical (HDA) aggregation. By replacing shallow skip connections with tree- and stage-aware fusion, DLA yields more accurate representations while using fewer parameters across image classification, fine-grained recognition, semantic segmentation, and boundary detection. The approach demonstrates consistent gains over ResNet/ResNeXt and competitive performance against DenseNet, with notable efficiency advantages. The authors provide extensive experiments on ImageNet, Cityscapes, CamVid, and boundary datasets and release the code at the provided repository, highlighting practical impact for compact, high-performance vision models.

Abstract

Visual recognition requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representations improves inference of what and where. Architectural efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to best aggregate layers and blocks across a network deserves further attention. Although skip connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations. We augment standard architectures with deeper aggregation to better fuse information across layers. Our deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes. The code is at https://github.com/ucbdrive/dla.

Deep Layer Aggregation

TL;DR

This work introduces Deep Layer Aggregation (DLA), a backbone-agnostic framework that deeply fuses semantic information and spatial resolution through iterative (IDA) and hierarchical (HDA) aggregation. By replacing shallow skip connections with tree- and stage-aware fusion, DLA yields more accurate representations while using fewer parameters across image classification, fine-grained recognition, semantic segmentation, and boundary detection. The approach demonstrates consistent gains over ResNet/ResNeXt and competitive performance against DenseNet, with notable efficiency advantages. The authors provide extensive experiments on ImageNet, Cityscapes, CamVid, and boundary datasets and release the code at the provided repository, highlighting practical impact for compact, high-performance vision models.

Abstract

Visual recognition requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse. Even with the depth of features in a convolutional network, a layer in isolation is not enough: compounding and aggregating these representations improves inference of what and where. Architectural efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to best aggregate layers and blocks across a network deserves further attention. Although skip connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations. We augment standard architectures with deeper aggregation to better fuse information across layers. Our deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes. The code is at https://github.com/ucbdrive/dla.

Paper Structure

This paper contains 17 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Deep layer aggregation unifies semantic and spatial fusion to better capture what and where. Our aggregation architectures encompass and extend densely connected networks and feature pyramid networks with hierarchical and iterative skip connections that deepen the representation and refine resolution.
  • Figure 2: Different approaches to aggregation. (a) composes blocks without aggregation as is the default for classification and regression networks. (b) combines parts of the network with skip connections, as is commonly used for tasks like segmentation and detection, but does so only shallowly by merging earlier parts in a single step each. We propose two deep aggregation architectures: (c) aggregates iteratively by reordering the skip connections of (b) such that the shallowest parts are aggregated the most for further processing and (d) aggregates hierarchically through a tree structure of blocks to better span the feature hierarchy of the network across different depths. (e) and (f) are refinements of (d) that deepen aggregation by routing intermediate aggregations back into the network and improve efficiency by merging successive aggregations at the same depth. Our experiments show the advantages of (c) and (f) for recognition and resolution.
  • Figure 3: Deep layer aggregation learns to better extract the full spectrum of semantic and spatial information from a network. Iterative connections join neighboring stages to progressively deepen and spatially refine the representation. Hierarchical connections cross stages with trees that span the spectrum of layers to better propagate features and gradients.
  • Figure 4: Interpolation by iterative deep aggregation. Stages are fused from shallow to deep to make a progressively deeper and higher resolution decoder.
  • Figure 5: Evaluation of DLA on ILSVRC. DLA/DLA-X have ResNet/ResNeXt backbones respectively. DLA achieves the highest accuracies with fewer parameters and fewer computation.
  • ...and 2 more figures