Interactive Sampling for Class Discovery in Unstructured Data | TU Wien

Information

Publication Type: Master Thesis
Workgroup(s)/Project(s):
- Joint Human-Machine Data Exploration
Date: 2026
Date (End): June 2026
TU Wien Library: AC17909432
Second Supervisor: Stefan Neumann
Open Access: yes
First Supervisor: Manuela Waldner
Pages: 106
Keywords: Sampling, Human-AI Collaboration, Visual Analysis, Unstructured Data

Abstract

Analyzing unstructured data such as images presents a major challenge for exploratory data analysis due to their high dimensionality. The data must first be transformed into embeddings, which results in lower dimensionality where the data are more closely grouped. Sampling is essential to make these datasets more understandable for humans through visualization. This work aims to explore how interactive systems can provide representative and interpretable samples quickly, even from large and unbalanced image datasets. Standard methods like random sampling can reach their limits and often fail to capture rare classes, leading to biases in interpretation. Within a standardized interaction protocol, various data-driven strategies (e.g., farthest sampling and Dα sampling) and model-aware strategies (e.g., min-margin and disagreement) are compared with random sampling. The goal is to investigate which strategies offer the best balance between fast class discovery, high model accuracy, and low latency within a defined interaction budget. The results show that data-driven methods are strong in the early stages of the iterative process, as they explore the data space and lead to faster discovery of new classes. In contrast, model-aware methods offer advantages in later stages, as they refine the decision boundaries and efficiently increase accuracy as labeled data become available. The superiority of targeted sampling over random sampling is particularly evident with unbalanced datasets. Furthermore, the work shows that GPU acceleration reduces latency in the iterative cycle, thus maintaining the critical threshold of less than one second per selection step, enabling smooth interactions.

Additional Files and Images

thesis

Weblinks

BibTeX

@mastersthesis{fitz-2026-isc,
  title =      "Interactive Sampling for Class Discovery in Unstructured
               Data",
  author =     "Lukas Fitz",
  year =       "2026",
  abstract =   "Analyzing unstructured data such as images presents a major
               challenge for exploratory data analysis due to their high
               dimensionality. The data must first be transformed into
               embeddings, which results in lower dimensionality where the
               data are more closely grouped. Sampling is essential to make
               these datasets more understandable for humans through
               visualization. This work aims to explore how interactive
               systems can provide representative and interpretable samples
               quickly, even from large and unbalanced image datasets.
               Standard methods like random sampling can reach their limits
               and often fail to capture rare classes, leading to biases in
               interpretation. Within a standardized interaction protocol,
               various data-driven strategies (e.g., farthest sampling and
               Dα sampling) and model-aware strategies (e.g., min-margin
               and disagreement) are compared with random sampling. The
               goal is to investigate which strategies offer the best
               balance between fast class discovery, high model accuracy,
               and low latency within a defined interaction budget. The
               results show that data-driven methods are strong in the
               early stages of the iterative process, as they explore the
               data space and lead to faster discovery of new classes. In
               contrast, model-aware methods offer advantages in later
               stages, as they refine the decision boundaries and
               efficiently increase accuracy as labeled data become
               available. The superiority of targeted sampling over random
               sampling is particularly evident with unbalanced datasets.
               Furthermore, the work shows that GPU acceleration reduces
               latency in the iterative cycle, thus maintaining the
               critical threshold of less than one second per selection
               step, enabling smooth interactions.",
  pages =      "106",
  address =    "Favoritenstrasse 9-11/E193-02, A-1040 Vienna, Austria",
  school =     "Research Unit of Computer Graphics, Institute of Visual
               Computing and Human-Centered Technology, Faculty of
               Informatics, TU Wien",
  keywords =   "Sampling, Human-AI Collaboration, Visual Analysis,
               Unstructured Data",
  URL =        "https://www.cg.tuwien.ac.at/research/publications/2026/fitz-2026-isc/",
}