Word embeddings, such as the word2vec neural network, are trained with a large amount of text data to be able to infer a meaning of a word from its co-occuring words. One useful application is information retrieval, where semantically similar documents can be found based on a user query. Word embeddings are based on the distributional hypothesis: words that often occur together are semantically similar. To establish meaningful semantic similarities, they therefore require a lot of training texts.
Word embeddings can be projected to two dimensions using a dimensionality reduction technique and visualized as 2D scatterplots (see, for instance, the projection of a trained word2vec network in TensorFlow’s Embedding Projector) to visually observe semantically similar groups of words.
In this project, this concept should be reversed: instead of visualizing the training results of a word embedding in 2D, users can manually express a semantic similarity by moving text fragments close to each other on a 2D plane. This 2D proximity information should then be used to learn a distance function. Distance metrics are necessary for clustering or classification, as well as for recommendation of similar content. A personalized distance metric essentially expresses how semantically similar words are for the user who trained the distance metric.
This project should build upon an existing software infrastructure that allows users to mark text fragments from web pages, captures the selected text, and places it on a freely movable and zoomable 2D plane (see figure). This software infrastructure is implemented as Chrome extension with a d3.js front-end.