Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

Yunfei Huang; Elena Van der Vorst; Alexander Richard; Benedikt Sabass

Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

Yunfei Huang, Elena Van der Vorst, Alexander Richard, Benedikt Sabass

Abstract

Traction force microscopy (TFM) is a widely used technique for quantifying the forces that cells exert on their surrounding extracellular matrix. Although deep learning methods have recently been applied to TFM data analysis, several challenges remain-particularly achieving reliable inference across multiple spatial scales and integrating additional contextual information such as cell type to improve accuracy. In this study, we propose ViT+UNet, a robust deep learning architecture that integrates a U-Net with a Vision Transformer. Our results demonstrate that this hybrid model outperforms both standalone U-Net and Vision Transformer architectures in predicting traction force fields. Furthermore, ViT+UNet exhibits superior generalization across diverse spatial scales and varying noise levels, enabling its application to TFM datasets obtained from different experimental setups and imaging systems. By appropriately structuring the input data, our approach also allows the inclusion of metadata, in our case cell-type information, to enhance prediction specificity and accuracy.

Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

Abstract

Paper Structure (13 sections, 3 equations, 5 figures, 1 table)

This paper contains 13 sections, 3 equations, 5 figures, 1 table.

Introduction
Materials and methods
U-Net
Vision Transformer
Hybrid Model: ViT+UNet
Training
Evaluation Metrics
Results
The hybrid ViT+UNet architecture consistently outperforms both standalone U-Net and Vision Transformer models in predicting traction force fields
The ViT+UNet Architecture Exhibits Superior Generalization Across Diverse Spatial Scales
ViT+UNet demonstrates robust generalized prediction performance under different additional noise levels
Adding Cell-Type Information as Input Data Enhances Prediction Specificity and Accuracy
Conclusion

Figures (5)

Figure 1: Schematic diagram of traction force microscopy and the architectures of deep learning models used for traction force analysis. (A) Experimental setup: a single cell adheres to the surface of an elastic gel, and the traction force field is calculated from the measured displacement field. (B) Architecture of U-Net, consisting of downsampling, a middle block, and upsampling stages. The black arrows indicate 2D convolutions; red arrows represent downsampling through 2D convolutions; blue arrows denote 2D transposed convolutions used for upsampling; and purple arrows indicate skip connections (copy operations). The input and output of the model correspond to the displacement field and traction force field, respectively. (C) Architecture of Vision Transformer (ViT), which includes image pre-processing, a transformer encoder, and a convolutional decoder. Note that the transformer encoder contains $L$ layers. (D) Architecture of the hybrid deep learning model ViT+UNet, which combines a Vision Transformer with a U-Net framework. In this design, the middle block of U-Net is replaced by the Vision Transformer module.
Figure 2: The hybrid ViT+UNet architecture consistently outperforms both standalone U-Net and Vision Transformer models in predicting traction force fields. (A) Representative cell image and its corresponding displacement field used as input (i). The reference traction force field, calculated using Bayesian Fourier-transform traction cytometry (BFTTC), is shown in (ii). Note that the vector field is plotted only for regions where the traction magnitude exceeds $50\,\mathrm{Pa}$, skipping every 15 indices for clarity. (B, i–iii) Comparison of predicted traction force fields obtained from U-Net, Vision Transformer (ViT), and ViT+UNet models. (C) Zoomed-in views corresponding to the colored boxes in panel (A-ii). (D, i–iii) Joint probability distributions of force magnitudes for the predictions and references, $p(F_\text{Ref}, F_\text{NN})$, obtained from U-Net, ViT, and ViT+UNet models. Forces larger than $150\,\mathrm{Pa}$ are considered; better predictions with higher probability density values lie along the diagonal dotted line. (E) Statistical comparison of normalized root mean square error (NRMSE; i) and correlation coefficients (ii) across 34 test cases. These two metrics are defined in Eqs. (\ref{['eq:NRMSE']}) and (\ref{['eq:correlation']}). Lower NRMSE values indicate more accurate predictions, while correlation coefficients closer to 1 reflect stronger agreement with the reference data. Error bars represent standard deviations.
Figure 3: The ViT+UNet architecture exhibits superior generalization across diverse spatial scales. (A) Schematic diagram illustrating the inference procedure for different spatial scales. The scale ratio $s$ is defined as $s = l / L$, where $l$ is the scaled length used during inference and $L$ is the original length used during training, as shown in (i). For inference cases with $s < 1$ and $s > 1$, these are referred to as zoom-in and zoom-out scenarios, respectively, illustrated in (ii) and (iii). (B) Example of an original cell image with its corresponding displacement field (i) and traction force field obtained using Bayesian Fourier-transform traction cytometry (BFTTC) in (ii). (C) Generalization inference results for the zoom-in ($s = 0.6$; ii–iv) and zoom-out ($s = 1.67$; vi–viii) scenarios obtained from U-Net, ViT, and ViT+UNet models compared with the reference data shown in panels (i, v). (D) Statistical summary showing normalized root mean square error (NRMSE; i) and correlation coefficients (ii) for ViT+UNet, U-Net, and ViT across 34 test cases over scale ratios ranging from $s = 0.25$ to $2.3$. Error bars represent standard deviations.
Figure 4: The ViT+UNet model generalizes well in predicting traction forces from displacement data under varying noise levels. (A-i) Dimensionless displacement field showing both magnitude (color map) and vector directions. (A-ii) Histogram of variance values and their average computed from dimensionless displacement fields in the training dataset. (A-iii) Gaussian noise magnitude and corresponding vector field at an 8% noise level. (A-iv) Dimensionless displacement field after adding 8% Gaussian noise, illustrating both magnitude (color map) and vector directions. (B) Reference traction force field obtained using Bayesian Fourier-transform traction cytometry (BFTTC); colors indicate traction force magnitudes. (C) Zoomed-in views of traction force fields under noise-free conditions [(ii–iv)] and at an 8% noise level [(v–vii)] predicted by three models—U-Net, ViT, and ViT+UNet—corresponding to the green box in panel (B), compared with the reference region shown in (i). (D) Plots of normalized root mean square error (NRMSE; i) and correlation coefficients (ii) for all three models at different noise levels ranging from 0 to 9%. Error bars represent standard deviations.
Figure 5: Incorporating cell-type information improves prediction specificity and accuracy. (A) Architecture of the hybrid deep learning model ViT+UNet, which can also embed cell-type text information as an additional input. (B) Representative cell image with its corresponding displacement field and annotated cell type (C2C12) in panel (i). The reference traction force field obtained using Bayesian Fourier-transform traction cytometry (BFTTC) is shown in panel (ii). (C, i–iv) Zoomed-in views of predicted traction force fields obtained from ViT, ViT+cell-type, ViT+UNet, and ViT+UNet+cell-type models within the blue box region indicated in panel (B-ii). (D) Statistical comparison of normalized root mean square error (NRMSE; i) and correlation coefficients (ii) among the four methods: ViT, ViT+cell-type, ViT+UNet, and ViT+UNet+cell-type. Error bars represent standard deviations. (E, i–iv) Joint distributions of force magnitudes $p(F_\text{Ref}, F_\text{NN})$ binned for the deep learning models ViT, ViT with added cell-type input (ViT+cell-type), ViT+UNet, and ViT+UNet with added cell-type input (ViT+UNet+cell-type).

Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

Abstract

Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

Authors

Abstract

Table of Contents

Figures (5)