A common way to visualize text corpora is to create “term spatializations” or “term landscapes”, where the text content is visualized through a 2D embedding of the most relevant terms used in the documents, similarly to classic word clouds. These term spatializations are usually a static snapshot of a corpus and cannot convey the evolution of the corpus content over time.
The goal of this project is to a term landscape of a large, dynamically changing corpus (e.g., of a news feed or a publication collection), which shows how the content is evolving over time. A way to achieve this goal is to create aggregate term spatializations, where related terms are grouped into semantically meaningful clusters, which are connected by directed links indicating the temporal transitions between these clusters. The major challenge will be to find a suitable dynamic layout of terms to let the user interactively zoom in space (i.e., untangling clusters of terms) and time (i.e., restricting the visualization to a certain time span).
- Strong interest in information visualization, natural language processing, and human-computer interaction
- Solid programming skills
The visualization is suitable for different document sources. Consequently, this project can be embedded into different environments, such as:
A web browser extension to visualize the history of visited information on the web (implementation to manually extract and store information from web pages in a browser database is available).
An online RSS reader or news portal for streaming data (implementation to read and process RSS feeds is available).
An online visualization of twitter texts or other social media postings.
The recommended development environment is d3.js for the front-end and Python for natural language processing. The work can build upon existing work implemented in Python and d3. However, this can be adjusted depending on the preference of the student.