Visual Comparison of Natural Language Processing Pipelines

Type: 
BA/PR/DA
Persons: 
1

Description

When creating visual summaries of text corpora, such as tag clouds (e.g., [Viegas et al. 2009]), document landscapes (e.g., [Wise et al., 1995, Berger et al., 2017]), or document cards [Strobelt et al., 2009], a natural language processing (NLP) pipeline is necessary to convert the raw text input into a structured input for the visualization. Such an NLP pipeline can have a varying number of steps (e.g., stopword removal, lemmatization, part-of-speech selection etc.), and each step can have several parameters. Selecting and configuring appropriate NLP pipeline steps is crucial for creating an expressive visualization. However, there is little guidance on how such a pipeline should look like.

Tasks

The goal of this project is to create a visual analysis environment to interactively compare a large number of possible NLP pipeline configurations for visualizing text corpora. The resulting visualization should show a selection of the most relevant key terms extracted across all pipeline configurations. In contrast to a classic word cloud, which encodes term relevance of the N selected top key terms through font size, this word cloud show reveal the variance of a key term’s relevance across multiple NLP pipelines, as well as for how many NLP pipelines a key term would be picked for visualization. A master thesis should also include the design and implementation of interactive exploration techniques so that the user can efficiently find the best pipeline configuration for a given text document.  

Requirements

  • Strong interest in information visualization and natural language processing
  • Programming experience
  • Experience with web technologies (JavaScript, d3, …) and / or Python advantageous

Environment

This work can build upon existing NLP projects, available as Python (recommended) or JavaScript code. The recommended front-end for the visual analysis environment is based on d3.js or using Python visualization libraries, such as bokeh, depending on the preference of the student.

Contact

For more information please contact Manuela Waldner (waldner@cg.tuwien.ac.at).