Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

Markus J. Buehler

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

Markus J. Buehler

TL;DR

Cephalo presents a family of open-source multimodal vision-language models designed for bio-inspired materials analysis and design by fusing a vision encoder with an autoregressive transformer and training on image-text data sourced from Wikipedia and scientific literature. The work explores multiple model sizes and architectures, including 8b base models, larger merged 10b/12b variants, and 4b Phi-based models, plus a sparse Mixture-of-Experts (MoE) extension, all validated on tasks ranging from fracture mechanics to bio-inspired design. A robust dataset-generation pipeline extracts image-caption pairs from both papers and Wikipedia, with specialized training to predict stress-field statistics and crack dynamics, achieving high $R^2$ values (up to ≈0.98) and strong crack-initiation accuracy (≈0.98). The authors demonstrate an image-to-text-to-image/3D workflow, enabling visualization of novel designs, while emphasizing open-source accessibility and the potential for autonomous lab workflows, edge deployment, and educational use. Together, Cephalo advances integrated visual-text reasoning in materials science, enabling rapid interpretation of microstructures, quantitative fracture analysis, and bio-inspired material design through scalable MoE and model-merging strategies.

Abstract

We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding. A key innovation of Cephalo is its advanced dataset generation method. Cephalo is trained on integrated image and text data from thousands of scientific papers and science-focused Wikipedia data demonstrates can interpret complex visual scenes, generate precise language descriptions, and answer queries about images effectively. The combination of a vision encoder with an autoregressive transformer supports multimodal natural language understanding, which can be coupled with other generative methods to create an image-to-text-to-3D pipeline. To develop more capable models from smaller ones, we report both mixture-of-expert methods and model merging. We examine the models in diverse use cases that incorporate biological materials, fracture and engineering analysis, protein biophysics, and bio-inspired design based on insect behavior. Generative applications include bio-inspired designs, including pollen-inspired architected materials, as well as the synthesis of bio-inspired material microstructures from a photograph of a solar eclipse. Additional model fine-tuning with a series of molecular dynamics results demonstrate Cephalo's enhanced capabilities to accurately predict statistical features of stress and atomic energy distributions, as well as crack dynamics and damage in materials.

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

TL;DR

values (up to ≈0.98) and strong crack-initiation accuracy (≈0.98). The authors demonstrate an image-to-text-to-image/3D workflow, enabling visualization of novel designs, while emphasizing open-source accessibility and the potential for autonomous lab workflows, edge deployment, and educational use. Together, Cephalo advances integrated visual-text reasoning in materials science, enabling rapid interpretation of microstructures, quantitative fracture analysis, and bio-inspired material design through scalable MoE and model-merging strategies.

Abstract

Paper Structure (37 sections, 2 equations, 20 figures, 3 tables)

This paper contains 37 sections, 2 equations, 20 figures, 3 tables.

Introduction
Background and motivation
Outline of this paper
Results and discussion
Cephalo-8b model series
Model merging to create deeper, mode expressive models: Cephalo-10b/12b model series
Cephalo-4b model series
Image to text to image and 3D modalities
Mixture of Experts modeling: Constructing Large Models from Smaller Trained Component Models
Model tuning to predict stress field statistics and crack dynamics
Prediction of stress and atomic energy distribution statistics in materials with defects
Prediction of crack dynamics
Conclusions
Summary of key contributions
Outlook and future research
...and 22 more sections

Figures (20)

Figure 1: Overall approach used to develop the multi-modal vision LLM. Panel a: The model consists of a vision encoder (left side) that produces image tokens that are combined with text tokens in the autoregressive transformer model (center) with flexible outputs (right side). Panel b: Delineation of the development of the dataset to train the model, effectively transforming raw data into valuable insights. The data used for training consists of both text-only data (taken from Luu2023BioinspiredLLM:Materials and Buehler2023MechGPTModalities_fixed as well as newly created image-text datasets, as well as data from molecular dynamics modeling. In the process, raw data undergoes summarization and reasoning steps, evolving from scattered pieces of information into interconnected knowledge. This transformation then enables deeper understanding and effective decision-making, highlighting the model's capability to synthesize complex data into practical, actionable insights (e.g., design a material such that it does not fail under mechanical stress).
Figure 2: Visualization of the overall approach to generate datasets for training the vision model. Reproductions of two representative pages of the scientific article (here, Spivak2011CategoryNetworks, reproduced with permission from PLOS ONE via a Creative Commons License.
Figure 3: Histogram of the number of tokens for the image-text dataset, showing the source captions from Wikipedia (a) and the paper corpus (b). Panels c-e show the results processed with different vision-text models. Panel c shows the histogram of the token numbers for the processed image descriptions for the Wikipedia (done using Idefics-2). Panels d and e show the results for the paper corpus dataset, processed using Idefics-2 (d) and GPT-4o (e). The GPT-4o dataset generally yields much longer descriptions. A detailed analysis of the content shows that it provides much enhanced reasoning and nuanced explanation of the image content. All tokenization done using the Phi-3-Vision tokenizer abdin2024phi3.
Figure 4: Histogram of the image resolutions extracted from Wikipedia (a) and the paper corpus (b), for $X$ and $Y$ directions, respectively (left/right column). All tokenization done using the Phi-3-Vision tokenizer abdin2024phi3.
Figure 5: Histogram of the number of tokens for the text-only dataset, showing questions only (a), answers only (b), and combined question-answer (c). This dataset includes a corpus of knowledge extracted from scientific papers, books, and other sources in the area of biological materials, mechanics, and materials science. All tokenization done using the Phi-3-Vision tokenizer.
...and 15 more figures

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

TL;DR

Abstract

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

Authors

TL;DR

Abstract

Table of Contents

Figures (20)