Text Analyzer

Paper description

The paper by Oelke et al. [1] analyzes the readability in text documents. It shows which parts are easy to read, which parts are more difficult to read and especially why this is the case. The authors use a semi-automatic approach to select five readability measures out of 141 text- feature-candidates. They analyze the complexity of the vocabulary used in a document, by consulting a dictionary. Furthermore they take into account the sentence length and structure, the word length and the use of nominal forms. The longer a sentence or word is, the more sub-sentences it contains, or the more nominal forms occur, the more difficult it is to read. From these measure criteria, the authors calculate an overall score and visualize the result in three different views. The detail view analyzes each single sentence in term of readability, while the block view analyzes whole text blocks or sections. The corpus view gives an overview of all pages.

Task description

TextAnalyzer is a tool that analyzes the readability of PDF text-documents. The user selects a PDF file and weights each feature by updating sliders. The results are displayed as a detailed analysis in the user interface and as an annotated PDF file.

Download Program Source Code

Implementation

TextAnalyzer is programmed with Python 3.7, 64 bit. First, the user selects a PDF file to analyze. Then the program extracts characters from the PDF file with PDfMiner. It defines the sentences from the extracted characters. Next, the system evaluates the readability of each sentence by computing six features. This features are the complexity of the vocabulary used, the word length, the nominal forms, the length of the sentence, the sentence structure, and the use of imprecise formulations.

The system computes scores for each of the features, and for the whole sentence. Further, an overall score for the document is retrieved. From these scores, colors are interpolated. These colors are shown in the user interface and are used for highlighting the PDF files. The result is displayed in the user interface, programmed with PyQt5. The user has the possibility to adjust the weight of each feature by updating its sliders. This leads to a recalculation of the sentence score. Once this adjustment is finished, the user can generate a highlighted PDF file. This is done by PYMuPDF. In this PDF, more detailed comments are described that aim to help the user in improving the readability of the document. If the user wants to know, how an enhancement of a particular sentence effects the readability score, it is possible to directly edit sentences in the user interface. This leads to a recalculation of all feature scores, of the sentence score, and of the overall score. For more detail please go to module page and index page or see the pdf documentation.

Program controls

Reference

[1] Daniela Oelke, David Spretke, Andreas Stoffel, Daniel Keim. Visual readability analysis: How to make your writings easier to read. IEEE Visual Analytics Science and Technology (VAST), p. 123–130, 2010.