Visual Readability Analysis Application

Group:

SuperUberMegaGroup

Members:

Cerman Martin, 0625040, mcerman@prip.tuwien.ac.at

Original Paper:

Visual Readability Analysis: How to Make Your Writings Easier to Read [1]

Screenshot:

Source Code Location:

Source

Executable Location:

Executable

Abstract

In this documentation, we will describe the functionality of our developed application for the exercise part of the lecture "Visualization 2". The aim of this application will be described in Section 1. Section 2 will be split up into three parts, where each of the views that are provided by our application will be described. The statistical analysis system behind our visualization tool will be explained in Section 3, and Section 4 contains a description of the used technology.

1. Application Purpose and Description

Our application is an almost complete 1:1 implementation of the application described in the paper published by Oelke et al. [1]. The purpose of this tool is to provide a writer a possibility to revise his writing from a statistical and structural viewpoint. This includes aspects like contextual coherency, avoiding unknown vobabulary or difficult grammatical structures, and misspellings. After the writing has been revised, the applied tool provides a detailed feedback about passages and sentences.

As the authors mention in their paper, the statistical properties of a text that are measured include word length, number of words in a sentence / paragraph, use of easy or hard words, or number of stop words. Stop words are in this context words, that carry no information whatsoever. Concrete examples are "and", "to", "then" and many others. These stop words are also very often filtered out by search engines such as Google, but also in textual searches in a database. The second type of properties that can be measured in a text are the structural properties. These include for example the depth of the parse three of a sentence, number of sentences in passice voice, and others. Due to the fact, that the structural properties include an extensive analysis of the sentences and are therefore out of the scope of this exercise (the main purpose is the visualization aspect), we only implemented the statistical properties. After an evaluation of the different available statistical and structual properties, the authors decided to use 5 of these properties in their application. The used properties are:

Word Length
Sentence Length
Nominal Forms
Vocabulary Complexity
Sentence Structure Complexity

As can be seen, only the first two properties are actually statistical properties. Prior to implementing our analysis module, we prepared the UI for the implementation of all the named five properties, however later we realized, that only two of these can be effectively implemented in the available time. Therefore we decided to implement a third analysis property, that is based on a vocabulary and leave the remaining two properties as stubs. The third implemented statistic is the number of stop words, and the remaining two properties include purely randomized values.

To be able to implement the proposed application in an effective way, we needed to represent the text in a hierarchical structure, namely a tree. The whole text is first divided into paragraphs, these are further divided into sentences, and these are in turn divided into words and stored in a list. The sentences and paragraphs are modelled as individual classes (entities) and store the computed analytical properties. After an analysis has been performed, the computed properties are visualized in three different views, that are described in detail in Section 2. Our rating scheme uses a normalized scale in the interval [0 1]. Here, 0.5 means that a measured property is average, 0.0 means it is the best achievable measure, and 1.0 means analogously the worst possible measure. To give an example, when looking at the word length in a paragraph, the rating scheme is normalized between the mean word length minus three times the standard deviation (0.0) and the mean word length plus three times the standard deviation (1.0) in the corresponding paragraph. The sentence length is measured in an analogous way to the word length. The third measure, namely the number of stop words, should define the information content of a sentence or a paragraph. Herefore we divide the number of non-stop words in a sentence / paragraph by the total number of words in a sentence / paragraph. The stop words are included in a dictionary, that is defined by the class StopWordDictionary. This class contains a HashSet of these words and checks during the analysis process wether each word of the sentence is included in the HashSet. We specifically chose a the HashSet as the datastructure that stores the stop words, because the complexity of the "contains()" function is equal to O(1).

The main window has a very simple UI. On the left side can be seen the Analysis View, which serves the purpose of visualizing the results of the analysis. Further in the middle of the main window, there is the Text View that contains the main text to be analyzed. We used only a very simple algorithm that separates the individual sentences and paragraphs of the text, therefore a warning text appears in the Text View, if a syntactically wrong sentence is attempted to be analyzed. Further, on the right side can be seen the Analysis Button, which starts the analysis after a text has been written or pasted into the Text View. The UI of the main window can be seen in the picture below (click to enlarge).

After a text has been written in the Text View, the Analysis View is updated by default and it shows a miniature version of the Text View to provide the user a better idea of the structure of the text. By dragging the separating pane between the Analysis View and the Text View, the miniature is being magnified or demagnified. Examples of this process can be seen in the two pictures below (click to enlarge).

2. Views

In this Section, we will provide a detailed overview of the three available views for the visualization of the results of the analysis. The first view provides a demagnified thumbnail with highlighted text in colors, that correspond to the analysis results and our rating scheme. The second view is a detailed view of all five measured properties of each paragraph or sentence, and is displayed in a subdivided rectangle next to the text. The last view shows a small "pixelized" version of the text, where each rectangle corresponds to a paragraph or a sentence.

2.1 Thumbnail

When the user selects the Thumbnail View in the radiogroup, that appears on the right side of the main window after the Analysis Button has been pressed, a smaller version of the Text View is shown. In this smaller version the text is highlighted in the color, that corresponds to the computed text properties - red being 1.0 and blue 0.0. An additional image of this rating scheme is shown left to the Analysis View for better user feedback. The user has additionally the possibility to choose wether the paragraphs or sentences should be analyzed by clicking on the appropriate radio button on the right side of the main window. Examples of this Thumbnail View are given in the two pictures below, the first showing a paragraph view and the second a sentence view (click to enlarge).

The user further has the possibility to resize the Thumbnail View by dragging the pane between the Analysis View and the Text View to the right or left. This functionality remains always enabled for each of the possible analysis views. Additionally, the user has the possibility to choose which analyzed property should be shown in the Thumbnail View by clicking on the corresponding radio button on the right side of the main window.

The Thumbnail View has been implemented by using a trick in the used UI library (QT Jambi). Our application implements two different text edit fields - one visible in the middle of the screen (Text View) and one invisible, that serves as something like a back buffer. When the user is viewing the Thumbnail View, the contents of the Text View are copied to the invisible text edit field, the text is highlighted and a screenshot of this text edit field is taken. Afterwards, this screenshot is simply shown in the Analysis View.

2.2 Detail View

The Detail View is a more concise visualization of the analyzed properties than the Thumbnail View. Here, next to the beginning of each paragraph or sentence, a large rectangle that is divided into five equal parts corresponding to each of the properties is shown. Each of the subrectangles is colored based on the values of the measured properties. The user again has the possibility to enlarge or reduce the size of the Analysis View and thus also change the size of the rectangle showing the results. Examples showing the Detail View in action for a paragraph and sentence view are shown in the two pictures below (click to enlarge).

We also implemented a little window that is being shown after the user clicks on one of the subrectangles. This window contains the description of which measured property corresponds to this rectangle and the value of the measurement. Additionally, also a black line is drawn on top of the rating scheme image to give the user an idea where the measured value is in the interval of possible values. Examples showing this functionality are depicted in the two pictures below (click to enlarge).

The implementation of this type of view is pretty straightforward. We determine the width of the Analysis window and compute the resulting width of the rectangles in the Detail View. Afterwards, we split up this rectangle into five parts and draw five colored rectangles inside with help of the provided functionality of the GUI library. When the Analysis View is resized, the size of the rectangles are recomputed and redrawn.

2.3 Fingerprint

The last type of view that is implemented is the Fingerprint View. This visualization technique is the most concise one and it consists of five large rectangles. Each of these rectangles corresponds to one of the measured properties. Further, each of the rectangles contains a set of small squares that represent each of the paragraphs or sentences in the text. As always, the color visualizes the value of each measurement. This type of view should provide the user a general impression of the text. We additionally implemented a functionality, that automatically scales the main rectangles in the Fingerprint View. When the width of the Analysis View is above 300px, the rectangles are grouped such that always two of the rectangles are next to each other. When the width of the Analysis View is below 300px, all of the rectangles are drawn above each other. Examples showing Fingerprints of paragraphs and sentences in both forms are shown in the four pictures below (click to enlarge).

3. Statistical Analysis System

As we already mentioned in Section 1, we implemented only two statistical and one vocabular analysis methods. The remaining two measurements are randomly created floating point numbers in the interval [0 1]. In this Section, we will shortly describe the implemented analysis methods.

3.1 Word Length

The first implemented statistical measurement is the word length. This measurements is computed in two runs - the first run computes the average word length in the whole text, and the second run updates the measures. This functionality is partly implemented in the TextAnalyzer class, and partly in the Sentence and Paragraph classes. When the whole text is being broken down into tokens (words), the average word length is being computed as well. Afterwards, the application iterates through all the sentences and paragraphs and computes the final measurement. Here, the standard deviation of the word length in each sentence / paragraph is estimated and then used to update the measures. If the length of a word is above or below three times the standard deviation, it is automatically set to 1.0 or 0.0 respectively. Otherwise the word length is linearly scaled.

3.2 Word Count

The second implemented statistical measurement is the word count in a sentence or paragraph. This measurement is computed analogously to the word length in two runs. The results are also being computed by estimating the mean and variance of the word count. We first tried to implement the rating scheme to be between the minimal and maximal word length in a sentence / paragraph, however the final results turned out to be too homogeneous (always very near to 0.5). This is why we chose to use a different measure - one that would model the word length and count using a normal distribution.

3.3 Number of stop words

The third implemented method is based on a vocabulary. In the beginning we intended to implement a domain specific vocabulary to estimate the vocabulary difficulty, however we rather settled on a combined measurement, which would also partly analyze the semantic value of the text. In this context, if a sentence / paragraph contains many stop words, then the semantic value of this sentence / paragraph is low. As we mentioned in Section 1, stop words are a specific set of words, that are ignored in search engines like Google, or in database searches. We modelled this vocabulary with the class StopWordDictionary and defined the words manually and put them into a HashSet. As already described, we used this datastructure, because the computational complexity of the "contains()" method is O(1). The stop words, that were used in our implementation were found on this page.

4. Technical Details

Before the implementation, we evaluated the requirements on our application. The application can be divided into two main parts - the GUI and the program logic. The program logic involves purely the analysis of the text, and it is executed only when the Analysis Button is pressed. As this part does not need to be in real-time, we decided that we do not need to implement it using native code. The second important thing is, that the results of the analysis are displayed in the GUI. The visualization is purely of 2-dimensional character and therefore no API for 3-dimensional graphics like OpenGL or Direct3D needs to be used. As a result we chose to use Java in conjunction with a GUI Library called Qt Jambi. This GUI library is a port from C++ of the original Qt Library. It has a very extensive implemented functionality that in all aspects serves our requirements, and Java itself is a relatively easy language from the syntactical viewpoint. Qt Jambi additionally provides easy to extend functions and classes. We used this advantage for example to implement the highlighting of the text in the Thumbnail View.

References

[1] D. Oelke, D. Spretke, A. Stoffel, D. A. Keim. Visual Readability Analysis: How to Make Your Writings Easier to Read, IEEE Conference on Visual Analytics, Science and Technology 2010, October, 2010, pp. 123-130