An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Sidebar: Advanced Text Mining

1 Leave a comment on paragraph 1 0 There are other more advanced tools that are worth exploring as your familiarity expands. Some of these have large user bases, whereas others have smaller ones. These include proprietary yet powerful programming languages, packages that have been built out of the field of computational linguistics and Natural Language Processing (or NLP), and the user-unfriendly yet important Software Environment for the Advancement of Scholarly Research. We do little more than sketch out the basic existence of these three data mining tools, as they are largely beyond the purview of this book. In this book, the code provided has generally all been in Python or – as we move forward into the next section – the programming language R. Here, we take a brief detour.

2 Leave a comment on paragraph 2 1 The Software Environment for the Advancement of Scholarly Research, or SEASR, is an important software suite developed by a team of humanities researchers and developers at the University of Illinois – Urbana-Champaign. While SEASR was originally very difficult to install and execute, it is now offered in fully-compiled and executable form at http://www.seasr.org/meandre/download/. It may require some basic command line knowledge, but the documentation is reasonably fulsome.1 Once it is downloaded, the various ‘start’ and ‘stop’ commands begin or stop the environment from running in the background of your computer. On OS X, for example, one needs to simply unzip the download and then click on the following two files: ‘start-infrastructure.command’ and ‘start-workbench.command’. The workbench can then subsequently be accessed through your web browser at http://localhost:1712/Workbench.html. You may need to subsequently add a basic set of default flows and commands. To add these, in the left hand panel of the environment click on ‘locations,’ and then add the following locations: http://repository.seasr.org/Meandre/Locations/Latest/Flows/demo-all/repository_components.rdf (for components) and http://repository.seasr.org/Meandre/Locations/Latest/Flows/demo-all/repository_flows.rdf (for flows).

3 Leave a comment on paragraph 3 0 Explicitly designed for digital humanists, SEASR operates as a module-based system: combining small components to be joined together to execute as an overall program. One component might import textual data, for example, and then export this data to a second component that makes it all lower case. In this environment, grouping several components together results in a ‘flow.’ Consider the following demonstration flow, built into the platform:

4 Leave a comment on paragraph 4 0 The ‘Demo Token Counts’ Flow in MEANDRE, Right after Hitting ‘Run Flow’The ‘Demo Token Counts’ Flow in MEANDRE, Right after Hitting ‘Run Flow’

5 Leave a comment on paragraph 5 0 Each component has ‘inputs’ and ‘outputs,’ except for a handful of ‘input’ components which take data from the Workbench itself. In the above flow, we can see in the red component that data is imported (it pops a window up for you to enter a URL, some text, or upload a file), and then goes through a process of steps: a component figures out what kind of data it is, an extractor correspondingly extracts the text (from a website, for example), text is then put into lower case, tokenized, counted, that is made into text, and eventually we have an output. This flow counts tokens, or words.

6 Leave a comment on paragraph 6 0 Where MEANDRE shines is that fairly complicated components of NLP are built right into it. Take Part of Speech tagging. Consider the more advanced flow below:

7 Leave a comment on paragraph 7 0 The ‘Demo Part-of-Speech Tagging’ FlowThe ‘Demo Part-of-Speech Tagging’ Flow

8 Leave a comment on paragraph 8 0 This takes data, extracts text, puts it through the open-source OpenNLP package, and then eventually outputs every single word within a text tagged with the Penn English Treebank POS tags. So here, an excerpt of John Adams’ diary via this flow:

9 Leave a comment on paragraph 9 0 Output from the Flow Laid out in the Previous ImageOutput from the Flow Laid out in the Previous Image

10 Leave a comment on paragraph 10 0 The full legend for the Penn Treebank POS tags is available in several places online, such as http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html. PRP is pronoun, personal for example; NN is noun, common, singular or mass; NNP is noun, proper; JJ is adjective, and so forth. If you are only interested in a select few types, you can edit the component that is dividing the text into the POS tags. If one clicks on the OpenNLP POS Tagger component, various properties appear. By reading the documentation that appears at lower-right, you can see that the filter_regex field needs to be filled out if you want to be selective. So if you want adjectives and proper nouns, the regex (as discussed previously in this chapter) would be JJ|NNP.

11 Leave a comment on paragraph 11 0 Altering the Properties for the POS Tagger (note right panel)Altering the Properties for the POS Tagger (note right panel)

12 Leave a comment on paragraph 12 0 This is just a short, brief introduction to some of the flows that are possible using MEANDRE. While we consider this an advanced tool, it is making great strides in usability and should be considered for more advanced flows and procedures. As you progress with your skills and develop various competencies throughout this book, remember that you may be able to use MEANDRE to join these various components together into their flows.

  1. 13 Leave a comment on paragraph 13 0
  2. See http://www.seasr.org/meandre/documentation/installation/ []
Page 70

Source: http://www.themacroscope.org/?page_id=385