|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Visualizing Correlated Topics

1 Leave a comment on paragraph 1 0 You’ve run your topic model, and you’re now interested in exploring the results. One useful thing to look at is the ways the various topics correlate with one another. Perhaps there are certain patterns in which topics are positively or negatively correlated (that is, the presence of one topic implies -positive- or not -negative- another). Assuming you’ve done your topic modeling in R and have transformed the composition file so that you’ve got topics down the rows and documents across the columns, here’s one way of calculating, and then visualizing those correlations.

2 Leave a comment on paragraph 2 0 1. look at the topics-docs file in excel.
2. In excel, select data analysis, and then correlation. Since topics are down the left side (rows), select data in rows, with labels in first column.
3. The result shows how topics are correlated with one another. This is useful, and can be represented as a graph.
4. You’ll notice a number of blank cells. Fill these cells with zeros. (These blank cells are just the mirrored position of the values shown, and so for clarity are made blank).

3 Leave a comment on paragraph 3 0 To import this data as a ‘matrix’ into Gephi, we have to do some massaging:

4 Leave a comment on paragraph 4 0 5. Open a new file in Notepad++
6. Put the following at the top:

5 Leave a comment on paragraph 5 0
dl
n = 30
format = fullmatrix
labels:

6 Leave a comment on paragraph 6 0 The spaces around the ‘=’ sign, and the lack of space after the word ‘labels’, are important. Without these, the file breaks. This header information tells Gephi (or Ucinet, or any other network visualization or graph software) that it is dealing with a ‘dl’ file (which is a workhorse format for network analysis). The next line tells Gephi how many nodes to expect. ‘Fullmatrix’ means that the data will be ‘symmetrical’, that there will be information describing node A’s tie to B, and B’s tie to A. Our data as presented here (remember those blank cells?) are not symmetrical, but we’ll deal with that in a moment. The next line contains our node names. These have to be in a single row, set off from one another by commas.

7 Leave a comment on paragraph 7 0 7. Copy the data values from your spreadsheet to Notepad++ immediately below your labels row. You’ll see that there’s a largish space between the values. This is a tab space, and there’ll be one of these to mark off each column value from your spreadsheet.
8. highlight that first empty tab space. Press ctrl+f. This opens a find dialogue box, with the empty space in the ‘find’ box. In the ‘replace’ box, tap the space bar once, to replace it with a single space. Click on replace all.
9. Save this file as text but type in .dl after your file name in the ‘save as’ box.

8 Leave a comment on paragraph 8 0 This is how it might look:

dl
n = 30
format = fullmatrix
labels:
topic1, topic2, topic 3, [and so on]
data:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.341435562 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.050431406 -0.027694806 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.0558177 0.051223372 -0.052275421 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[... and so on]

Now to Gephi:

9 Leave a comment on paragraph 9 0 10. In gephi, select file >> open graph file >> select the .dl file you just made.
11. There is no directionality implied in this data, so although Gephi thinks it is directed (as you can read on the info panel as it loads your data), select ‘undirected’. Gephi can only deal with .dl files in a limited number of versions. Fullmatrix is the easiest one to work with given this particular data. The 0s that you filled in to the spreadsheet make the matrix full, but treating the data as ‘undirected’ thus ignores them. This is a hack.
12. Your graph should load up. Every topic will be connected to every other topic. So you’ll want to display *just* the positive correlations, or *just* the negative ones, or perhaps you’re interested in particular strengths. You select the ‘filter’ option, and filter by edge weights.
13. You should delete edges with weight = 1, as these are just topics 100% positively correlated with themselves. Click on the data laboratory window, highlight the edges with weight 1, and hit delete.

10 Leave a comment on paragraph 10 0 You now have a graph showing the pattern of correlations. How best to lay this graph out is not immediately apparent. That will be the subject of a subsequent section.

11 Leave a comment on paragraph 11 0  

Page 80

Source: http://www.themacroscope.org/?page_id=453