Table of Contents
Fetching ...

A two-stream network with global-local feature fusion for bone age assessment

Qiong Lou, Han Yang, Fang Lu

TL;DR

BoNet+ tackles the trade-off between global skeletal context and local bone details in bone age assessment by employing a two-stream architecture with a Transformer-based global feature extractor and an RFAConv-based local feature extractor, fused through Inception-V3. The approach achieves state-competitive MAEs on RSNA and RHPE datasets, with ablation studies confirming the complementary benefits of both modules. The method also demonstrates robust performance and interpretable attention shifts via Grad-CAM analyses, suggesting practical potential to reduce clinician workload. Overall, the work advances automated BAA by integrating global and local information in a principled, clinically aligned framework.

Abstract

Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual's growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

A two-stream network with global-local feature fusion for bone age assessment

TL;DR

BoNet+ tackles the trade-off between global skeletal context and local bone details in bone age assessment by employing a two-stream architecture with a Transformer-based global feature extractor and an RFAConv-based local feature extractor, fused through Inception-V3. The approach achieves state-competitive MAEs on RSNA and RHPE datasets, with ablation studies confirming the complementary benefits of both modules. The method also demonstrates robust performance and interpretable attention shifts via Grad-CAM analyses, suggesting practical potential to reduce clinician workload. Overall, the work advances automated BAA by integrating global and local information in a principled, clinically aligned framework.

Abstract

Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual's growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the pipeline used in BoNet+.
  • Figure 2: The sample number distribution across different age groups in two public datasets (including training and validation sets).
  • Figure 3: The Transformer module ref12.
  • Figure 4: The RFAConv module ref11.
  • Figure 5: Statistical results of the proposed method in bone age assessment. (a) Actual age and predicted age on the RSNA validation dataset. (b) Actual age and deviation on the RSNA validation dataset. (c) Actual age and predicted age on the RHPE validation dataset. (d) Actual age and deviation on the RHPE validation dataset.
  • ...and 1 more figures