Visual Active Learning for News Stream Classification

Information

Abstract

In many domains, the sheer quantity of text documents that have to be parsed increases daily. To keep up with this continuous text stream, a considerable amount of time has to be invested. We developed a classification interface for text streams that learns user-specific topics from the user’s labeling process and partitions the incoming data into these topics. Current approaches that try to derive content categorization from a vast number of unstructured text documents use pre-trained learning models to perform text classification. These models assign predefined categories to the text according to its content. Depending on the use case, a user’s interests might not coincide with the given categories. The model cannot adapt to changing terminology that was not present during training. Besides these factors, users often do not trust pre-trained models as they are a black box for them. To solve this problem, our application lets users define a classification problem and train a learning model through interaction with a Star Coordinates visualization. The approach that makes this interaction efficient is a variant of active learning. This active learning variant states that a learning model can achieve greater accuracy with fewer labeled training instances, if a user provides data purposefully from which it learns. We adapted this strategy for text stream classification by visualizing the topic affiliation probabilities of the learning model and providing novel interaction tools to enhance the model’s performance iteratively. By simulating different selection strategies common in active learning, we found that our visual selection strategies correspond closely to the classic active learning selection strategies. Further, users performed on par with the best simulated selection strategies in the results from our preliminary user study. Our evaluation concludes that there are benefits from incorporating information visualization into the active learning process.

Additional Files and Images

Additional images and videos

Additional files

Weblinks

BibTeX

@mastersthesis{mazurek-2018-vac,
  title =      "Visual Active Learning for News Stream Classification",
  author =     "Michael Mazurek",
  year =       "2019",
  abstract =   "In many domains, the sheer quantity of text documents that
               have to be parsed increases daily. To keep up with this
               continuous text stream, a considerable amount of time has to
               be invested. We developed a classification interface for
               text streams that learns user-specific topics from the
               user’s labeling process and partitions the incoming data
               into these topics. Current approaches that try to derive
               content categorization from a vast number of unstructured
               text documents use pre-trained learning models to perform
               text classification. These models assign predefined
               categories to the text according to its content. Depending
               on the use case, a user’s interests might not coincide
               with the given categories. The model cannot adapt to
               changing terminology that was not present during training.
               Besides these factors, users often do not trust pre-trained
               models as they are a black box for them. To solve this
               problem, our application lets users define a classification
               problem and train a learning model through interaction with
               a Star Coordinates visualization. The approach that makes
               this interaction efficient is a variant of active learning.
               This active learning variant states that a learning model
               can achieve greater accuracy with fewer labeled training
               instances, if a user provides data purposefully from which
               it learns. We adapted this strategy for text stream
               classification by visualizing the topic affiliation
               probabilities of the learning model and providing novel
               interaction tools to enhance the model’s performance
               iteratively. By simulating different selection strategies
               common in active learning, we found that our visual
               selection strategies correspond closely to the classic
               active learning selection strategies. Further, users
               performed on par with the best simulated selection
               strategies in the results from our preliminary user study.
               Our evaluation concludes that there are benefits from
               incorporating information visualization into the active
               learning process.",
  month =      oct,
  address =    "Favoritenstrasse 9-11/E193-02, A-1040 Vienna, Austria",
  school =     "Research Unit of Computer Graphics, Institute of Visual
               Computing and Human-Centered Technology, Faculty of
               Informatics, TU Wien ",
  URL =        "/research/publications/2019/mazurek-2018-vac/",
}