Welcome to Viz2 2021 Lee et al LDA’s documentation!

LDA class methods:

class LDA_simplified.LDA(k, path)[source]

Init of class: load and clean the data, and make the first lda clustering with the the given k cluster :param: k: desired number of clusters used in LDA , should be at least 2, and 10 at maximum :param: path: Load path of the file

build_bag_of_words_model()[source]

Build the bag of words model from the cleaned data and the dictionary of the unique words

Returns

dictionary of all words, bag of words model of the documents

build_term_higlights(doc_input)[source]

Setup the document view by highlighting the most relevant terms with their topic correspondence

Parameters

doc_input – Id of the document

Returns

str(): document as string, with the proper html tags assigned for specific terms (will

be used in the visualization by dash.DangerouslySetInnerHTML())

calculate_cosine_similarity()[source]

Calculate Cosine similarity between the documents with sklearn implementation Also removing the duplicities (taking only the upper trinagle elements from the result) In order to avoid duplicate edges on the graph

Returns

cosine_sim_matrix = np.array([number_of_documents,number_of_documents])

clean_lemmatize_data()[source]
Cleaning of data based on the paper description and with further useful approaches:
  • tokenize

  • removal of numeric characters

  • removal of punctuations

  • removal of stopwords

  • lemmatize the ‘cleaned data’

Returns

dict<document_number, title> , dict<title, cleaned_text>

color_assign_to_topic(x)[source]
Parameters

x – topic id

Returns

color assigned to topic x (dict)

color_assign_to_topic_with_opacity(x)[source]

Assign color to topics with opacity

Parameters

x – topic id

Returns

opaque color for term highlighting

delete_cluster()[source]

Remove cluster from lda: modifying the state of the model

Returns

lda with the removed cluster

property filter_data

Filter for the paper-defined time interval(1994-2010) Separate the title and the document description

Store in dictionary format: k: Title , v: document text

Returns

Filtered data stored in dictionary<title, document text>

filter_parall_coords_topic_contribution(value)[source]

filter paralell coordinates based on the input value (>value has to be kept) filtering also the document-topic df to filter in cytoscape

Parameters

value – paralell coordinate filter threshold

:return:filtered paralell coordinates and topic dataframe

format_topics_sentence()[source]

build up a pandas dataframe with several useful informations: document - Topic belongings, contribution, assigned color keywords

Returns

pd.DataFrame(‘Document_No’, ‘Dominant_Topic’, ‘Topic_Perc_Contrib’, ‘Keywords’, ‘Text’, ‘Title’,’color’)

get_col()[source]
Returns

cluster colors (extracted from matplotlib colors)

get_color_with_opacity(id, is_node_id)[source]
Parameters
  • id – id of the node /cluster

  • is_node_id – bool, whether the selected cyoscape element is document node or cluster node

Returns

opaque color for the background

get_colormap_for_cluster()[source]

Build colormap for wordcloud

Returns

colormap related to current cluster (assigned by the already related cluster color)

get_document_nodes()[source]

build dictionary for document nodes to the cytoscape network visualization

Returns

dict<document_id,(document_title, document_color: color of the cluster,cluster)>

get_filtered_edges()[source]

Get the visible edges (edges between document nodes over the cosine sim threshold

Returns

[(node_0, node_1, cosine_similarity value),…]

get_k()[source]
Returns

number of clusters specified

get_lda()[source]

Model LDA with the given cluster number and the built up bag of words model

Returns

lda model

get_most_relevant_topics()[source]

extract the most relevant top 4 terms for the topics (relevant for topic node representation)

:return:dict<cluster_id, [top_4_terms_for_cluster]>

get_parall_coord_df()[source]

Build pandas dataframe for the paralell coordinates view

Returns

pandas dataframe for documents with their dominant topic

get_top_n_word_probs_for_topic_i(topic_id, n=10)[source]

Term weight table input: extract the top n words for the currently selected topic

Parameters
  • topic_id – id of the cluster

  • n – number of words to be extracted

Returns

dataframe with the words and the related probabilities

get_top_topic_for_words()[source]

build Topic - word - probability df with the related opaque color for term highlight in document view

Returns

pandas.DataFrame(‘Word’, ‘Color’)

property get_topic_nodes

build dictionary for the topics which will be input for the cytoscape node, generate a random position as well

Returns

<topic_id, (color, position)>

get_word_probabilities()[source]

extract all the word probabilities from the lda model for each cluster

Returns

dict<topic_id, [word_probabilitities]>

merge_cluster(cluster_ids)[source]

Merge clusters selected from the checklist: sum up the probs at row wise

Parameters

cluster_ids – cluster ids selected in checklist

Returns

model with merged clusters

read_data()[source]

Read the original dataset with bssoup xml extractor :return: data stored in BeautiflSoup instance

remove_document(value)[source]

Remove documents by clicking on the delete document button

Parameters

value – id of the document node

Returns

updated dictionary of nodes, by removing the marked document

reset_settings()[source]

Reset the the view and the lda class itself with the original cluster number

set_indexed_topic_node_df()[source]

topic dataframe in indexed format: for faster searching

Returns

indexed df

update_cosine_sim()[source]

Update cosine similarites

update_lda()[source]

Re-cluster with the new cluster number (reaction of “update” button on the app)

update the relevant class components after cluster merge / delete , re-clustering steps

Visualization methods:

visualization.build_cluster_merge_list()[source]

Prepare the checklist for the merge cluster functionality

Returns

dash checklist content with proper label

visualization.build_cluster_summary_view()[source]

prepare the data for the cluster summary view

Returns

data in cluster summary view feedable format

visualization.get_doc_topic_edges()[source]
Returns

Invisible edges within document nodes and their dominant topic

visualization.get_graph_cos_sim_edges()[source]
Returns

edges within document based on cosine similarity

visualization.get_graph_document_nodes()[source]
Returns

graph document nodes with the proper coloring in cytoscape format

visualization.get_graph_topic_nodes()[source]

extract the color from class settings: will be defined in stylesheet the label is the top 4 words

Returns

topic nodes in cytoscape format

visualization.plot_wordcloud(number_of_words=20)[source]

Wordcloud plot

Parameters

number_of_words – number of words to be plotted (default: 20)

Returns

Wordcloud plot (image, not interactive, but the words are not overlapped)

visualization.update_stylesheet()[source]

Update stylessheet: define all the new clusters class settings (colors etc.)

Returns

Updated graph stylesheet

Indices and tables