Table of Contents
Fetching ...

ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models

Danrui Li, Yichao Shi, Yaluo Wang, Ziying Shi, Mubbasir Kapadia

TL;DR

ArchSeek tackles architectural case search by fusing visual and textual data through vision-language models and cross-modal embeddings, enabling text and image queries with in-session recommendations. It introduces a survey-informed database and three user modes (text, image, and interactive recommendations), using $cosine similarity$ between embeddings and a Reciprocal Rank Fusion approach to combine modalities. Evaluation includes a 77-query quantitative study with ablations and a four-task user study, showing superior retrieval performance and positive usability feedback while highlighting diversity and interface improvements as future work. The approach promises more efficient, personalized precedent discovery in architecture and could generalize to other visually driven design domains.

Abstract

Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.

ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models

TL;DR

ArchSeek tackles architectural case search by fusing visual and textual data through vision-language models and cross-modal embeddings, enabling text and image queries with in-session recommendations. It introduces a survey-informed database and three user modes (text, image, and interactive recommendations), using between embeddings and a Reciprocal Rank Fusion approach to combine modalities. Evaluation includes a 77-query quantitative study with ablations and a four-task user study, showing superior retrieval performance and positive usability feedback while highlighting diversity and interface improvements as future work. The approach promises more efficient, personalized precedent discovery in architecture and could generalize to other visually driven design domains.

Abstract

Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.

Paper Structure

This paper contains 16 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The framework of ArchSeek shows the database construction stage and the query stage for single design case. In the database construction stage (left), all media files of the design case are augmented by a vision language model, generating architecture design reviews from various aspects. Then, their embeddings are generated for later use. In the query and recommendation stage (right), all three types of user interactions are converted into embeddings, compared to the embeddings of database items.
  • Figure 2: User attention distribution on different topics of a design case when using architecture design case recommender systems.
  • Figure 3: Using the vision-language model to extract analysis text from design case images. (left) The text prompt is used when calling the model. (right) A snippet of an output example.
  • Figure 4: The user interface of ArchSeek in Image Query mode. The interface displays image analysis and adjustable weight parameters via slider bars, followed by the retrieved design cases. The thumbnails of the design cases are partially masked for Fair Use Policy compliance.
  • Figure 5: Top five retrieved design cases using queries in various perspectives. Each retrieved design case comes with a similarity score (shown in light gray rounded rectangle below the title) and the most related description and image in the database. The thumbnails of the design cases are partially masked for Fair Use Policy compliance.
  • ...and 3 more figures