|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling with the Stanford TMT

1 Leave a comment on paragraph 1 0 Different tools give different outputs, even if they are all still ‘topic modeling’ writ large. One tool that historians should become familiar with is the Stanford Topic Modeling Toolbox, because its outputs make it easy to see how topics play out over time (it ‘slices’ its output by the dates associated with each document). Earlier, we showed how one could use Paper Machines to find patterns in broad swathes of John Adams’ diaries. The TMT allows us to track the evolution of topics within a corpus at a much finer level, at the level of each individual entry.

2 Leave a comment on paragraph 2 0 Return, for a moment, to one of the online pages showing Adams’ diary entries. Right-click on the page, select ‘view source’, and examine the tags used to mark up the html. Every so often, you will see something like:

<div class="entry">
<div> <b> <span title="1761-03-21">

Using Outwit Hub (or a similar piece of software or a script you’ve written for yourself following the precepts of The Programming Historian), we point our scraper at everything that falls within the “entry” class. Outwit Hub is smart enough to recognize the data in the ‘title’ span as its own column. We can tweak what Outwit Hub will look for by specifying the various tags we are interested in. In the illustration below, we have set a scraper up to create a csv file from the html markup for one of the diary pages. It produces two columns, extracting dates and entries:

3 Leave a comment on paragraph 3 0 Screenshot of the scraper screen in Outwit HubScreenshot of the scraper screen in Outwit Hub

4 Leave a comment on paragraph 4 0  

5 Leave a comment on paragraph 5 0 The resulting output will look like this:
_________________
1753-06-08 | At Colledge. A Clowdy ; Dull morning and so continued till about 5 a Clock when it began to rain ; moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
1753-06-09 | At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
1753-06-10 | At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation.
__________________

6 Leave a comment on paragraph 6 0 We then need to insert a column at the beginning of this file, so that each diary entry gets its own unique record number:
_________________
1 | 1753-06-08 | At Colledge. A Clowdy ; Dull morning and so continued till about 5 a Clock when it began to rain ; moderately But continued not long But remained Clowdy all night in which night I watched with Powers.
2 | 1753-06-09 | At Colledge the weather still remaining Clowdy all Day till 6 o’Clock when the Clowds were Dissipated and the sun brake forth in all his glory.
3 | 1753-06-10 | At Colledge a clear morning. Heard Mr. Appleton expound those words in I.Cor.12 Chapt. 7 first verses and in the afternoon heard him preach from those words in 26 of Mathew 41 verse watch and pray that ye enter not into temptation.
__________________

7 Leave a comment on paragraph 7 0 We save this as ‘johnadamsscrape.csv’. With our data extracted, we turn to the toolbox. Installing the TMT is a matter of downloading, unzipping, and then double-clicking the tmt-0.40.0jar, which brings up this interface:

8 Leave a comment on paragraph 8 1 tmt-openingThe TMT operates by running various scripts the user creates. This allows a lot of flexibility, but it also can seem very daunting for the first time user. However, it is not as scary as it might first seem. The TMT uses scripts written in the Scala language. For our purposes, we can simply modify the sample scripts provided by the TMT team.

9 Leave a comment on paragraph 9 0 The first script you’ll need is this one: http://nlp.stanford.edu/software/tmt/tmt-0.4/examples/example-2-lda-learn.scala .

10 Leave a comment on paragraph 10 0 Download this script, and make sure to save it with the .scala file extension. Save it in the same folder as your data csv.

11 Leave a comment on paragraph 11 0 The critical line is line #15:

12 Leave a comment on paragraph 12 0 val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);

13 Leave a comment on paragraph 13 1 This line is telling TMT where to find your data, and that the first column is the unique ID number. Change the example file name to whatever you called your data file (in our case, “johnadams.csv”).

14 Leave a comment on paragraph 14 0 The next line to examine is line #26, in this block:

15 Leave a comment on paragraph 15 0
val text = {
source ~> // read from the source file
Column(3) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in // filter out 60 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms

16 Leave a comment on paragraph 16 0 }

17 Leave a comment on paragraph 17 0 That is, the line stating ‘Column(3)’ .  This tells TMT to look in the third column for the actual text we wish to topic model. It is also extracting and filtering common words that might create noise; whether or not 60 is an appropriate number for this corpus is something with which we should experiment.

18 Leave a comment on paragraph 18 0 Finally, you may wish to examine line 39:
val params = LDAModelParams(numTopics = 30, dataset = dataset,

19 Leave a comment on paragraph 19 0 Chaning the numTopics allows you to change, well, the number of topics fitted in the topic model. Save your script (we used the name “johnadams-topicmodel.scala”). In the TMT interface, select File >> open script. Select your script. Your TMT interface will now look like this:

20 Leave a comment on paragraph 20 0 tmt-opening2Script loaded into TMT

21 Leave a comment on paragraph 21 0 We find it useful to increase the available memory, by changing the number in the Memory box. Click ‘run’, and you should soon have a new folder in your directory named something along the lines of “lda-afbfe5c4-30-4f47b13a”.

22 Leave a comment on paragraph 22 0 Open this folder. There are many subfolders, each one corresponding to a progressive iteration of the topic model. Open the last one, labeled ’01000′. There are a variety of .txt files; you will want to examine the one called ‘summary.txt’. Here’s what we found when we opened that file:

23 Leave a comment on paragraph 23 0 Topic 00        454.55948070466843
company    24.665848270843448
came    21.932877059649453
judge    14.170879738912884
who    13.427171229569737
sir    10.826043463533079
court    10.621417032296238
large    9.058726324185434
smith    8.620197844724935
five    7.882410867138
gentleman    7.675483167471634
dinner    7.583714936353056
first    6.65132894649075
began    6.291534160289178
among    6.291111232529454
young    5.966725682973136
learned    5.944359777213496
three    5.443460073918396
england    5.341208927199161
afternoon    5.203749175860598
indies    5.072753107504141

24 Leave a comment on paragraph 24 0 Topic 01        502.4277262791061
you    30.658841887225073
man    19.181634029977502
certain    15.644384186357204
ever    12.381172625305108
your    12.25406398752824
can    11.902589119940561
people    11.732468403208458
government    9.908493705531717
sentiments    9.823770672269507
some    8.918181912129057
thing    8.83725027351356
yet    8.190124087540463
truth    7.853835541378251
hard    7.845192177833083
religion    7.729692134126026
against    7.068338535593137
mind    7.0620208346987114
times    6.964056570613583
things    6.832061662591681
well    6.65447491844415

25 Leave a comment on paragraph 25 0 Topic 02        331.37962468930994
wind    15.5707545428336
air    13.720380035394792
sail    10.106641945001083
over    8.99708890744929
too    7.951593434798237
hope    7.913553658711964
yesterday    7.516737835490574
own    7.313213026923483
water    6.977090373858353
winter    6.831908927860477
cutting    6.214408929903847
tobacco    5.980872297509475
carry    5.624876811077536
afternoon    5.374917164383244
pleasure    5.136067588436411
tea    5.06992603176156
lay    4.9088691229339245
whole    4.869729134203029
round    4.839995131413627
clock    4.743872091113478

26 Leave a comment on paragraph 26 0 Topic 03        448.284256156574
land    17.778012915126133
whole    13.123036968936766
shall    9.514616949571051
gave    9.494894659553903
hollis    9.29197925448871
own    8.593135071924731
took    8.469199778137252
should    8.077970235305811
paid    7.848347239525517
money    7.210870558399691
deed    6.961527039137438
note    6.7934221741957534
made    6.513517724924776
large    6.25033592564551
set    6.177065855377116
given    6.096479925682025
did    6.0569638705150135
suppose    6.03240282569139
thayer    6.000101321903717
agreement    5.93673392235739

27 Leave a comment on paragraph 27 0 Topic 04        413.9978480329246
cattle    13.916489343705752
here    12.048529145962895
trees    9.310496087436924
found    8.219660996273614
spring    7.864942644117993
years    7.32296021451369
worth    6.7151843915303235
father    6.278988636393422
horses    6.233138869485303
where    5.933235113162917
history    5.877151768910737
clerks    5.8764829093672555
turned    5.829377832486527
sell    5.823865635062695
lodged    5.4818260491169015
find    5.45930198495912
excellent    5.427548496141824
method    5.289937866182177
sheep    4.949177711570194
round    4.947129278257872

28 Leave a comment on paragraph 28 0 Topic 05        398.57185926580644
without    16.940951355012423
governor    11.820477359244915
himself    10.54195539936789
conversation    7.922777157243907
commission    7.865160808300328
council    7.680323215347758
states    7.424164407384594
terms    7.375082333710888
open    6.924607742693697
might    6.786397687589714
deal    6.580647184023226
act    6.5511679874569575
before    6.509521703658348
set    6.062097236121528
should    5.944137880395839
continue    5.871578815195353
agreed    5.146795324113523
courts    5.041669581121881
opposition    4.788450782762402
concerning    4.784925656700257

29 Leave a comment on paragraph 29 0 Topic 06        416.030593178196
night    14.240966416313256
attention    12.347996066723644
most    11.241743472905751
fine    10.037808800075435
character    8.872132251013642
office    8.464699696642223
son    7.986495814607395
concerning    7.863195389706481
another    7.698129167308344
gentlemen    7.590692589761336
chamber    6.761286415037169
things    6.529246398663952
pleasant    6.405096090603263
meeting    6.087139655478849
worcester    6.067063940358273
court    5.991195896488669
distress    5.978779642557402
nor    5.943415477358717
before    5.656345276419348
return    5.378429636401525

30 Leave a comment on paragraph 30 0 Topic 07        518.4200467768803
she    17.948033384881477
family    14.47647830596949
whole    13.282304631535421
men    11.478515807353983
way    11.467328119355724
people    10.23820875559242
how    9.056573531589867
health    8.72228963025797
give    8.510787458446739
pleasure    8.350948144136094
only    8.337945293190678
every    8.313200130485287
year    7.6069541997921295
cannot    7.295520326946436
well    7.294433635481967
first    7.166731591259783
life    7.025728832622939
take    6.678066800606009
property    6.6567079690449145
loss    6.599303978668974

31 Leave a comment on paragraph 31 0 Topic 08        435.3651185613488
well    15.933935827778972
some    14.89132212941205
who    11.721459256113768
too    11.705821475731934
order    10.46390066524771
over    9.874549672150856
under    9.789506730044103
late    7.942186743938908
laid    7.240330438278974
other    7.197987789733989
tree    6.857290683013494
scarcely    5.87849127159789
kings    5.735650349699925
sent    5.735488290743638
ones    5.667276249254262
america    5.278853130850817
king    5.239148704066663
thousand    5.173559584477483
taken    5.107587972737042
country    5.073454576430829

32 Leave a comment on paragraph 32 0 Topic 09        413.7691586969443
river    16.76066627555275
other    15.865238822473223
few    10.367895938003862
most    9.8177816475414
where    7.930038196382615
road    7.845434738091708
seen    7.238243177252182
town    7.0736317721681505
last    6.9276568571785795
conversation    6.727922940925923
along    6.643011439308292
leagues    5.939537912151064
beautifull    5.927932570555253
view    5.927357396519654
several    5.770426315370302
grand    5.6395059670503604
sometimes    5.585584510231214
country    5.3870156568933645
within    5.326483794799042
houses    5.2865598026976794

33 Leave a comment on paragraph 33 0 Topic 10        498.75611573955666
town    16.69292048824643
miles    13.89718543152377
tavern    12.93903988493706
through    9.802415532979191
place    9.276480769212077
round    9.048239826246022
number    8.488670753159125
passed    7.799961749511179
north    7.484974235702917
each    6.744740259678034
captn    6.605002560249323
coll    6.504975477980229
back    6.347642624820879
common    6.272711370743526
congress    6.1549912410407135
side    6.058441654893633
village    5.981146620989283
dozen    5.963423616121272
park    5.898152600754463
salem    5.864463108247379

34 Leave a comment on paragraph 34 0 Topic 11        505.70629519579563
man    18.543955003561155
otis    15.134095435246232
did    13.49061932482415
other    13.034134942370102
coll    12.771087411258437
cushing    12.539941191477332
gave    11.416192609504742
adams    10.2548411977269
account    8.62911176000436
order    8.270318738633634
justice    8.057753458366147
give    7.996351742039744
cooper    7.984752401504277
spear    7.848741297095081
both    7.7964873896771145
find    7.655020226895261
seems    7.450959154727162
case    6.868324310811807
province    6.2433968276824885
judgment    5.944743344550242

35 Leave a comment on paragraph 35 0 Topic 12        403.1405559869719
next    15.181292202954456
before    14.794244933211804
who    12.50683436588712
agreed    11.597987844687957
went    11.388710809056008
called    10.082936138015882
never    8.981110578904996
franklin    7.592379151237804
commerce    7.169791737796643
son    6.764144513278713
together    5.986701317643771
laurens    5.976038798299178
goods    5.960985973026959
meet    5.957903745664021
grass    5.891087518954537
chase    5.698858061177832
week    5.521761714701976
her    5.120357383906495
produce    5.006837981387013
struck    4.949303119504711

36 Leave a comment on paragraph 36 0 Topic 13        409.8536943988823
read    34.85293526238413
last    15.683389440428762
night    13.236763279435175
colledge    11.894474054416943
book    11.080038375445453
law    11.001840658185927
arose    9.713622154950869
sun    8.951007109618203
every    8.743049061993299
clowdy    6.973787964898399
fire    6.747294355178253
conversation    6.49731635378895
chamber    6.263108831031562
first    6.164242880383288
reign    5.965111531260565
o’clock    5.956393575239489
reading    5.8508018932836325
since    5.232505047349898
thro    5.134359771314469
rise    4.96429835119514

37 Leave a comment on paragraph 37 0 Topic 14        421.3781735038306
hill    16.759580367760492
wall    12.852094658004125
billings    11.980000326520402
meadow    11.977296849370191
load    11.908907922048261
seaweed    8.997718825323341
brought    8.386088329672473
dollars    7.880892988461636
green    7.063882209941714
compost    6.99415897646728
sullivan    6.992674001674024
bushes    6.989315287102038
heap    6.989260091647422
trask    6.976853837394429
loads    6.973188147531885
bass    6.96109998720739
wood    6.864584579627929
penns    5.956070052593047
earth    5.067616000732246
thomas    5.0437693656607925

38 Leave a comment on paragraph 38 0 Topic 15        377.279139869195
should    14.714242395918141
may    11.427645785723927
being    11.309756818192291
congress    10.652337301569547
children    8.983289013109097
son    8.449087061231712
well    8.09746455155195
first    7.432256959926409
good    7.309576510891309
america    7.213459745318859
shall    6.9669200007792345
thus    6.941222002768462
state    6.830011194555543
private    6.688248638768475
states    6.546277272369566
navy    5.9781329069165015
must    5.509903082873842
news    5.462992821996899
future    5.105010412312934
present    4.907616840233855

39 Leave a comment on paragraph 39 0 Topic 16        493.77313011717285
here    21.163008393934177
must    15.263246924041349
could    13.448182874872714
come    13.1154629648676
people    12.207438378563111
her    12.032322225589951
thought    11.934271905103078
where    11.674268073284
did    11.499755185405471
never    10.247165614778545
take    9.545499332661917
good    9.258554085792984
against    9.045246831900286
untill    8.922284734969633
feel    8.62595049189018
things    7.35420806731146
let    7.162446814351331
letters    7.044813314090684
every    7.034301339720992
think    6.600187420349309

40 Leave a comment on paragraph 40 0 Topic 17        533.6126604382996
world    21.14121144226815
shall    16.6873882355237
ever    12.840450739003245
cause    12.757482657483397
good    12.22125850724624
public    11.916757266755615
think    11.16106317194544
better    10.59979601485501
myself    10.05947936099857
people    9.626368868759396
found    8.436685518018368
another    7.984455777429125
wit    7.951420891810835
ambition    7.832742489538909
seems    6.95534655267291
can    6.865943334030529
years    6.588874214825393
happy    6.547554571952693
sense    6.466336470580343
own    6.128423779955848

41 Leave a comment on paragraph 41 0 Topic 18        385.6024287288036
french    18.243384948219443
written    15.919785193963612
minister    12.110373497509345
available    10.615420801791679
some    9.903407524395778
who    9.245823795980353
made    8.445444930945051
congress    8.043713670428902
other    7.923965049197159
character    7.1039611800997005
king    7.048852185761656
english    6.856574786621914
governor    6.762114646057875
full    6.520903036682074
heard    6.255137288426042
formed    5.870660807641354
books    5.837244336904303
asked    5.83306916947137
send    5.810249556108117
between    5.776470078486788

42 Leave a comment on paragraph 42 0 Topic 19        463.8399814672987
like    16.669498616970372
thing    13.919424849237771
seen    11.574629657201303
just    10.98068782802169
before    10.39563422370793
same    10.321733495143041
every    9.318964489423447
thinks    9.301816234302077
how    9.241021309790339
himself    8.958291863717678
let    8.894087936920917
make    8.434304091716399
off    8.213148276635565
nothing    7.984654863187227
journey    7.939405373496938
some    7.6577309066599
europe    6.893783103885575
soon    6.689044030622907
found    6.675398872507987
down    6.631600198396599

43 Leave a comment on paragraph 43 0 Topic 20        353.68531355484936
colonies    10.94191657030556
american    10.627625578418174
french    9.311630337530515
ships    9.116808999532285
america    8.789871319781325
officers    8.099674167548653
bay    7.957315892260785
army    6.612323421646989
islands    5.980286057910326
british    5.833357562638487
change    5.796278466222622
trade    5.710893400319914
wheat    4.954474874676769
city    4.75435507054217
war    4.656283829918383
weight    4.43961441864401
line    4.405369216755544
merchant    4.382697612837661
france    4.323760410259206
commerce    4.164120508677964

44 Leave a comment on paragraph 44 0 Topic 21        663.3328901020956
court    23.191854810819954
law    17.01800070632575
same    13.308438831923574
never    12.249789467474011
bar    10.7785741427619
years    10.474991411350114
can    10.31530827620346
life    10.310719978307912
may    10.011849107051194
neither    8.935302948298379
attended    8.899066544161498
own    8.57353477635038
made    8.215259482902361
nor    8.191565693362435
quincy    8.045395882625971
words    7.868507068709965
make    7.841121306376327
study    7.830758248581027
gridley    6.981327263570097
fortune    6.970659363647978

45 Leave a comment on paragraph 45 0 Topic 22        444.49335743654325
rode    24.536745253214907
went    17.80931444551627
where    15.326693159663119
fine    14.737504074200448
here    12.811959486557452
town    11.163377818801834
good    10.399965202601614
country    10.115954454965902
returned    9.946610159497896
sent    9.45932460140789
york    8.842942993140694
far    8.40455813537355
walked    8.10292712657968
found    7.960818829823978
who    7.942186012573014
yet    7.087640549392621
reading    6.1377342721443675
pretty    6.063885099014661
miles    5.93966560596631
boston    5.7082727907358315

46 Leave a comment on paragraph 46 0 Topic 23        543.6900374103357
know    18.191101006744333
whether    15.78450543968227
town    14.055022576890448
because    13.640431511727474
dont    13.290418405939754
may    12.52142656830343
right    11.835923268100672
action    11.614387869486448
your    11.253404670197705
boston    10.871248526446022
how    10.347444434974852
think    9.985622501576309
who    8.228755132810628
first    8.121719449190092
ought    7.3937899556626165
cannot    7.387174049140093
told    7.305589643935685
actions    6.935204500591668
indeed    6.801567815491358
school    6.492034884772648

47 Leave a comment on paragraph 47 0 Topic 24        505.47683569982905
man    23.217007004873825
wife    16.383200023415363
old    15.505727919375182
company    11.87135710162956
her    11.303080931825544
brother    11.264388608786312
most    10.554258634617554
daughter    9.953821896039962
father    8.516105804023132
soft    7.913266833838625
sensible    7.866031987768896
body    7.580389757060074
age    6.461576332785828
prayer    6.152341821891593
deacon    5.966469438272275
worthy    5.910603415191331
america    5.910111486476384
head    5.8441838418267515
she    5.829760334373758
table    5.810224931148112

48 Leave a comment on paragraph 48 0 Topic 25        502.83112116458926
common    22.520928972730687
matter    19.605533934244384
center    18.997621294958265
motion    13.8013885321518
laws    13.68498185905496
gravity    12.981821043042576
without    12.215676727302307
natural    11.80731336312271
could    11.189211661902323
law    9.998850926439658
civil    8.615985620228031
explained    7.925557291691585
may    7.677670945234565
quantity    7.4958610490928415
weight    7.4662831294923455
other    6.454521300129154
method    6.158264173805285
viz    5.979594240750123
nature    5.874506218971089
each    5.86651474112662

49 Leave a comment on paragraph 49 0 Topic 26        401.4115361890247
told    14.728227976657179
england    12.570574634735012
vergennes    11.716830646061664
franklin    10.243002127621843
some    8.865565601935621
tomorrow    8.852773240797053
king    8.576801905509637
war    8.001918987005851
comte    7.952987767458861
france    7.805047355908059
might    7.383532926335597
hartley    6.881418224713707
letter    6.821062344954338
came    6.789983747489053
question    6.53708542834884
take    6.078365525422842
oswald    5.964517547872566
spain    5.783274480561738
going    5.722509293785221
ought    5.58933550149227

50 Leave a comment on paragraph 50 0 Topic 27        657.012669751482
went    62.25767071441023
mrs    32.25314279659847
where    23.755124311257475
adams    17.45190308026506
drank    13.989243300471225
coll    13.896730677244562
saw    13.222930412662718
lee    12.766948515702163
over    11.158737393247662
returned    11.024386996977343
heard    10.751605600641943
boston    9.830226625893596
tea    9.398248400985807
meeting    9.130728190257793
came    8.933210236940583
quincy    8.742633768541735
general    8.601958003116273
lodged    8.426462498581724
clock    8.192175041979231
took    7.64842269106185

51 Leave a comment on paragraph 51 0 Topic 28        196.62298102607843
les    15.974912107587873
qui    12.993974227384623
est    10.921318971843203
des    9.940284808306902
monsieur    8.76715612256162
dans    7.990399412603722
avec    6.995117277170765
tres    5.999587058369549
par    5.998266612768523
beaucoup    5.9954609141058315
chez    5.993936120612361
paris    5.789266058146875
que    4.998832211462627
une    4.928851121632018
dinner    4.199176027269274
ceux    3.9995544808682206
pour    3.997874624547264
france    3.9715721279153993
abby    3.940561021177146
comedy    3.9039564105620252

52 Leave a comment on paragraph 52 0 Topic 29        391.69536987760665
men    18.586530036898395
board    15.091159128577802
who    14.099967935056736
weather    10.194710293899979
sea    9.919929748518475
penniman    7.973664774319444
came    7.792604426668765
put    7.712537705671322
arrived    7.220271660040034
sick    6.930924079248561
thence    6.325015441175995
captain    5.906742353480079
select    5.905775358459385
once    5.651983069768761
ship    5.168083696975586
never    5.095483872236855
voyage    5.089278272209983
end    4.97656430440148
deacon    4.96448590855028
taken    4.951013356485312

53 Leave a comment on paragraph 53 2 We have 30 topics, their top words, and the relative weight of these words per topic. If you return to the “lda-afbfe5c4-30-4f47b13a” folder, there is also a .csv file indicating the distribution of topics over each of the diary entries, which can be visualized or explored further in a variety of ways.

54 Leave a comment on paragraph 54 0 In the next section, we will ‘slice’ the TMT topic model output to explore the development of these topics over time.

55 Leave a comment on paragraph 55 0  

Page 84

Source: http://www.themacroscope.org/?page_id=235