Speaker: Dominik Wolf
Abstract
This thesis investigates semi-automatic visual concept discovery in unlabeled image collections. Exploratory analysis of large image collections is challenging because relevant concepts are often not known in advance and the data typically lack explicit labels or taxonomies. While vision-language models can represent images and text in a shared embedding space, and clustering can reveal groups of visually similar images, the resulting clusters still need to be interpreted and refined by users.
To address this problem, the thesis proposes SaCoDi, a semi-automatic concept discovery workflow that combines multimodal image embeddings, clustering, cluster labeling, and interactive refinement. Images are embedded using a vision-language model and grouped into clusters. The resulting clusters are labeled using three alternative strategies: a WordNet-based strategy, a multimodal large language model strategy based on image samples, and a hybrid strategy that combines WordNet-derived candidate labels with language-model-based abstraction. Users can further refine the concept structure through split and merge operations, which trigger relabeling of the modified clusters.