8000 Canadians

Reading thirteen volumes from cover to cover is not lightly to be essayed. This author tried it for eight volumes; then, in the middle of volume IX came volume XIII in the mail. The sheer size of volume XIII broke the resolution that had stood staunch until then. It occurred to this author, as it might well have done before, that there must be other ways of enjoying and assessing the quality and range of the DCB than the straight route he had chosen. ((P.B. Waite, ‘Journeys through thirteen volumes: The Dictionary of Canadian Biography’ The Canadian Historical Review 76.3 (Sept. 1995):464-481))

The Dictionary of Canadian Biography (DCB) contains biographical essays in both French and English of over 8,000 Canadians (and sometimes, non-Canadians whose exploits are of significant interest to Canada including vikings Eric the Red and his son, Leif Ericsson). Its foundation was in the bequest of James Nicholson who in 1952 left his estate to the University of Toronto for the purpose of creating ‘a biographical reference work for Canada of truly national importance’. Today, it is hosted and curated by the University of Toronto and the Université Laval, with the support of the Government of Canada.

Every biographical essay is available online at www.biographi.ca. This makes it an excellent candidate as an example dataset for demonstrating a big history approach. What patterns will we see in this corpus of historical writing? The essays have been written over a period of 50 years, and so they span a number of different fashions in historiography, and are the work of hundreds of scholars. If there are global patterns in Canadian History writ large, then an examination of the DCB might discover it. As a series of biographical sketches, the nature of what we might extract is naturally constrained by the dictates of the form. For instance, we note later in this section the relative paucity of individuals from First Nations, but as our colleague Michel Hogue points out, this paucity could be explained not only in terms of the ways outsiders have regarded First Nations’ histories, or the nationalism implicit in a project of this nature, but also the difficulty in crafting a biography of peoples with an oral tradition. ((Hogue, pers. comm.)) A sidebar to this section will discuss the historiography of the DCB and the implications of its editorial policies [and will appear on this website]. ((For now, it is worth pointing out that early reviewers of the DCB thought that its focus on a chronological arrangement (in this case, the year of death of an individual determining in which volume they would appear) rather than an alphabetical arrangement might reveal patterns in Canadian history that would otherwise be hidden- see for instance Conway, John.’Review of The Dictionary of Canadian Biography. Volume 1: 1000 to 1700, The Catholic Historical Review , Vol. 55, No. 4 (Jan., 1970), pp. 645-648. p646. And of course every volume of the DCB will be a product of its time; senior figures in Canadian history have all contributed essays to the DCB and it is works such as Berger’s The Writing of Canadian History: Aspects of English-Canadian Historical Writing: 1900-1970 Oxford UP: 1976 that signpost the world(s) in which these contributors were working. Waite’s 1995 review of the 13 volumes to that point note the various lacunae – women, for one – and contexts (the 1960s put their indelible stamp on volume 10) that flavour the DCB.))

There are a number of ways one could extract the information from the DCB. ((A plugin for Firefox called ‘Outwit Hub’ can for instance be used to extract the biographical text from each webpage, saving it in a csv spreadsheet (the free version of Outwit Hub is limited to 100 rows of information). Using Outwit Hub, one can examine the html source of a page, identify which tags embrace the information one is interested in, and then direct the program to automatically page through the website, scraping the text between those tags. More information about using Outwit Hub is provided in the example concerning John Adams’ Diaries)) In the case of the Dictionary of Canadian Biography, we did the following:

1. Used wget to download every biographical essay (which may currently be found at http://www.biographi.ca/en/bio/). The Programming Historian 2 has additional information on how to use wget in this manner. ((Ian Milligan, “Automated Downloading with Wget,” Programming Historian 2, August 2012, available online, http://programminghistorian.org/lessons/automated-downloading-with-wget.))

2. Stripped out the html. One could use a script in R such as Ben Marwick’s html2dtm.r or run a python script such as that suggested in The Programming Historian. It is also possible to use Notepad++ to strip out everything before, and everything after, the text of interest across multiple files, using search and replace and the careful use of regex patterns. We left only the biographical essays, ignoring the bibliography and other meta data surrounding them.

3. Fitted a topic model.

4. Performed a network analysis to visualize the results.

There are now over 8,400 biographies on the site. However, for this analysis, we used a dataset of 8,000 biographies that were first scraped and cleaned up prior to the most recent version of the website’s redesign in 2011, when the site moved from Library and Archives Canada to its current home. Topic modeling should be done multiple times, in order to find the ‘right’ number of topics. In this case, we found that around 30 topics captured the spread of topics rather well. Using the R script detailed in the previous section, we found that topics clustered together rather neatly. In image xxx the chronological and thematic associations apparent in the labels make the diagram sensible to be read read from top left to bottom right.

Dendrogram of topics in the DCB
Dendrogram of topics in the DCB

Broadly, there is a topic called ‘land township lands acres river’… and then everything else. In a land like Canada, it is unsurprising that so much of every biography should be connected with the idea of settlement (if we can take these words as indicative of a discourse surrounding the opening up of territory for new townships, especially given the importance of river transport. Harold Innis would not be surprised).

The next branch down divides neatly in two, with subbranches to the left neatly covering the oldest colonies (excluding Newfoundland, which didn’t enter Confederation until 1949). Take another step down to the right, and we have a topics related to church, then to education, then to medicine. Taking the left branch of the next step brings us to discourses surrounding the construction of railways, relationships with First Nations, and government. Alongside the government branch is a topic that tells us exactly what flavour of government, too – Liberal (the Liberal party has governed Canada for the overwhelming majority of years since Confederation).

Scanning along amongst the remaining branches of the dendrogram, we spot topics that clearly separate out military history, industry, French Canada, the Hudson’s Bay Company and exploration. At a level of the dendrogram equal to all of these we have two topics that betray the DCB’s genesis in a scholarly milieu in WASP Toronto of the 1950s – ‘published canadian history author work’ and ‘methodist canada united black christian’.

This dendrogram contains within it not just the broad lines of the thematic unity of Canadian History as practised by Canadian historians, but also its chronological periodisation. This is perhaps more apparent when we represent the composition of these biographies as a kind of a network. Recall that our topic modeling script in R also created a similarity matrix for all documents. The proportions of each topic in each document were correlated so that documents with a similar overall composition would be tied (with weighting) to other most similar documents. The script output this networked representation so that we could analyze it using the Gephi package.

Positively correlated topics in the DCB
Positively correlated topics in the DCB

We take first the network of positively correlated topics, as in figure xxx. The edges connecting the topics together represent the strength of the correlation, ie, themes that tend to appear together in the same life. The heavier the edge, the more times those topics go hand in hand. We can then ask Gephi to colour-code those nodes (topics) and edges (correlations) so that those having similar patterns of connectivity are coloured the same way. ‘Railway construction line industry works’ often appears in thematic space together with ‘canadian ottawa british victoria vancouver’, which makes sense to us knowing that British Columbia’s price for entry into Confederation was the construction of the transcontinental railway, a project, with associated scandals and controversies which took up many years and governments in the late 19th century. This network diagram is really not all that different from the dendrogram visualization we examined first. What we can do, that we cannot do with the dendrogram, is ask which topics tie the entire corpus together? Which topics do the heavy lifting, semantically? This is not the same thing as asking which topics are found most often. Rather, we are looking for the topic that most often is the pivot point on which the essay will hang. The metric for determining this is betweenness centrality. In the figure above, the nodes are sized according to their relative betweenness scores. The topic that ties Canadian history together, on this reading, is ‘government political party election liberal’.

This throws up some interesting questions. Is the Liberal Party of Canada really the glue that holds our sense of national identity together? Or do the authors of these essays feel they need to discuss Liberal party affiliations (for surely not every individual in the corpus was a member of the party) when they see them, out of a sense of class solidarity (the Liberal Party being chiefly a party of the centre-left, a traditional home for academics)? How representative of Canadian history is this corpus of 8000 people? Did those who joined the party enjoy a higher prominence in the Liberal-affiliated newspapers of the time (which are now the sources used by the academics)?

When we look at the network visualization of individuals, where ties represent similar proportions of similar topics, we see a very complicated picture indeed. Of the 8,000 individuals, some 4,361 individuals (or 55%) tie together into a coherent clump. The other 45% are either isolated or participate in smaller clumps. Let us focus on that giant component. We can search for groupings within this clump. The image shows the clump coloured by modules. These subgroups seem to make sense again on periodization grounds- francophones from the New France era all clump together not because they are from New France or knew one another or had genuine social connections, but because the biographical sketches of this period all tend to tell us the same kinds of stories. They are stories about coureuers-du-bois, of Seigneurs, of government officials. What is perhaps more interesting will be the cases where an individuals are grouped together out-of-time.

We can also determine the individuals whose lives contain the thematic glue that ties this clump together. Will they be the politicians and great men whom we might reasonably expect? After all, the network of topics themselves point us towards a particular kind of political history, so this is not an untoward expectation. Here is what we find, when we rank by betweenness centrality:

GADOIS , PIERRE -> Montreal Island farmer



BLONDEAU , MAURICE – RÉGIS -> fur trader

PÉRINAULT (Perrinault), JOSEPH -> tailor

LEGARDEUR DE REPENTIGNY , PIERRE -> Governor Montmagny’s lieutenant

STUART , Sir JAMES -> lawyer

MARTEL DE MAGOS (Magesse), JEAN -> soldier

MONK , Sir JAMES -> lawyer

TASCHEREAU , THOMAS-JACQUES -> agent of the treasurers-general of the Marine

These men are for the most part footnotes in the broader story of Canadian history. Yet there is something about them that incorporates the weave and weft of that story. This network visualization, and these metrics, draw our attention away from the great men and instead direct our attention to individuals – a tailor! – whom we might not normally pay much attention to, in the broader scheme of things. If we look at the biography of that tailor, M. Joseph Perrinault, we find a quite exceptional life. Indeed, though the biography lists him by his first occupation (tailor), by the end of his life he was Justice of the Peace for Montreal, a representative in the legislature for Montreal West (with James McGill, the eponymn for the university), a commissioner for the relief of the insane and foundlings, and a one-time fur trader.

Let us look at a name every student of Canadian history should know, Sir John A. MacDonald, the first Prime Minister of Canada. Surely he was important to Canada as Washington was to the United States and so, even if he is not in the top ten, perhaps in the top 100? That is, we imagine that our politicians and leaders should be capable men and women with broad experience of the many avenues of life? This is not the case, unfortunately for Sir John. He appears on this list in 2,523rd place. Perrinault was active in politics in the last quarter of the 18th century, while Alexander was active around fifty years later. William Lyon Mackenzie King, who was Prime Minister during World War II, is second last. What if we looked at the betweenness centrality scores for every politician on this list, arranged by the period they were most active? Perhaps we might see a narrowing of horizons over time. Certainly the current Canadian Parliament (the 41st, with 310 members) has 112 former lawyers. Very few of those presumably had careers previous to the law.


Treating the biographies of 8,000 individuals who are spread over the centuries is a very distant way of looking at the patterns. Alternatively, we could have begun by dividing out the source documents into ‘century’ bins, and topic modeling each separate group. If however we are happy with creating a single topic model of all 8000, we can still examine patterns over chronology by sorting our individuals into distinct networks, by filtering by attributes in Gephi (as discussed in the sidebar).

Broadly considered, let us divide the 8,000 into ‘17th century and earlier’, ‘the long 18th century’, and ‘the 19th century’ (which we’ll draw to a close with World War I. What does Canadian history through lives lived look like this way?

The most between individual biographies for the 17th century are

 LEGARDEUR DE REPENTIGNY ,PIERRE; Governor Huault de Montmagny’s lieutenant.

LAUSON , JEAN DE (Junior); Grand seneschal of New France

BOURDON , JEAN; (sometimes called M. de Saint-Jean or Sieur de Saint-François)Seigneur, engineer, surveyor, cartographer, business man, procurator-syndic of the village of Quebec, head clerk of the Communauté des Habitants, explorer, attorney-general in the Conseil Souverain

DUBOIS DE COCREAUMONT ET DE SAINT-MAURICE , JEAN-BAPTISTE; esquire, artillery commander and staff officer in the Carignan-Salières regiment.

LE VIEUX DE HAUTEVILLE , NICOLAS; lieutenant-general for civil and criminal affairs in the seneschal’s court at Quebec

MESSIER , MARTINE -> wife of Antoine Primot; b. at Saint-Denis-le-Thiboult

LAUSON , JEAN DE (Senior); governor of New France.

HÉBERT , JOSEPH; grandson of Canada’s first settler, only son of Guillaume Hébert and Hélène D esportes

E ROUACHY (Eroachi , Esrouachit); known to the French as “La Ferrière,” “La Forière,” “La Fourière,” “La Foyrière”; chief of the Montagnais Indians around Tadoussac;

ATIRONTA (Aëoptahon), Jean-Baptiste , a captain in the Huron Indian village of Cahiagué (near Hawkestone, Ontario).

In these ten individuals, we have encapsulated the history of the French regime in North America – governors and seneschals, officers and aboriginal allies, and one woman. Martine Messier is primarily remembered for her courage when under attack by three Iroquois warriors (a courage retroactively imputed to her, perhaps, as her grandsons, the Le Moyne brothers, were famed adventurers).

The individuals whose stories tie the long 18th century together in what became Canada are

GADOIS , PIERRE; Montreal Island farmer, armourer, gunsmith, witchcraft victim

DUNN , THOMAS; businessman, seigneur, office holder, politician, judge, and colonial administrator

MARTEL DE MAGOS(Magesse),JEAN; soldier, merchant, trader, seigneur, clerk in the king’s stores

McGILL , JAMES; merchant, office holder, politician, landowner, militia officer, and philanthropist

KERR , JAMES; lawyer, judge, and politician

GRANT , Sir WILLIAM , lawyer, militia officer, and office holder

TODD , ISAAC; businessman, office holder, militia officer, and landowner

DOBIE , RICHARD; fur trader, businessman, and militia officer

POWNALL , Sir GEORGE; office holder, politician, and justice of the peace

TASCHEREAU , THOMAS-JACQUES; agent of the treasurers-general of the Marine, councillor in the Conseil Supérieur, seigneur

In these lives, we see the concern with reconciling the newly acquired Francophone colonists into the developing British world system. Echoes from the earlier regime, as evidenced by Gadois and Taschereau, still reverberate. There is no real reason why one would select the ‘top ten’ versus the ‘top twenty’ or ‘top one hundred’, but it is interesting that no aboriginal appears on this list until the 230th place (of 2,954 individuals), suggesting perhaps the beginnings of the eclipse of first nations’ history in the broader story of Canada (a supposition that would require deeper analysis to support or refute).

As we move through these broad periods, the overall network structure each time becomes more atomized, with more and more individuals whose (thematic-)lives do not tie into the larger group. In the nineteenth century modern Canada is founded. Religious callings appear in the lives of the top ten individuals’ whose stories tie the network together for the first time:

GUIBORD , JOSEPH , typographer, member of the Institut Canadien

LANGEVIN , EDMOND (baptized Edmond-Charles – Hippolyte ), priest and vicar general

GÉLINAS, ÉVARISTE , journalist, federal civil servant

BABY , LOUIS – FRANÇOIS – GEORGES , office holder, lawyer, politician, judge, and collector

STUART , GEORGE OKILL (O ’ Kill) , lawyer, politician, and judge

SIMPSON , JOHN, government official and politician

DAY , CHARLES DEWEY , lawyer, politician, judge, and educationalist

CONNOLLY ( Connelly ), MARY , named Sister Mary Clare , member of the Sisters of Charity of Halifax and teacher

CREEDON , MARIANNE (Mary Ann), named Mother Mary Francis (Frances), member of the Congregation of the Sisters of Mercy, mother superior, and educator

WALSH , WILLIAM , Roman Catholic priest, archbishop, and author

Indeed, the top one hundred in this network are either connected with the church (women who appear are predominantly nuns or teachers or both), the state, or the law. While to call Guibord a typographer is correct, it hides what he was typesetting. Guibord’s life encapsulated battles within Catholicism over liberal learning (the Institut contained a library whose books placed it in opposition to main stream Catholic teachings at the time). These individuals then speak to the playing out of battles within Catholocism and Protestantism, and between them, in the development of modern Canada. In the British North America Act of 1867, the spheres of religious influence are rigorously laid down, down to specific rights and privileges certain Anglophone (Protestant) ridings within the new province of Quebec were to have, and in other territories under the control of the new state. These decisions continue to have ramifications to this day; we can perhaps find their origins in these particular lives.

In these thirty lives, we see a picture of Canadian history familiar and strange at the same time, that suggest deeper questions, further research, other avenues, to explore. We do not do this kind of data mining with the idea that we are trying to provide definitive, conclusive, justification for the historical stories we are trying to tell. Rather, we are trying to generate new insights, and new kinds of questions. Explore this model and visualization for your self at http://themacroscope.org/interactive/dcbnet/.

Exploring the DCB by Volume

This exploration of the Dictionary of Canadian Biography is of course an exploration of a historical document in its own right. Fashions in historiography change, and particular individuals leave or join the project. The temporal period covered by the various volumes changes over time as well. Volume one covered all those of interest who died prior to 1700; volume two, those who died from 1701-1740; volume three, 1741-1770; volume four 1771-1800; volume five 1801-1820. As each volume has progressed, shorter and shorter spans of time have been covered while the rough number of individuals described remains the same. The volumes were not necessarily published chronologically either – volumes three, four, nine, and ten were published in the 1970s. ((The Dictionary of Canadian Biography, Volumes http://www.utoronto.ca/dcb-dbc/dcba/publications.htm))

In which case, one can reasonably suggest that a topic model, fitted across all biographies at once does not do justice to the materials contained therein. Using our R script, it is the work of a morning to organize the scraped files into their respective volumes, and to generate topic models by volume (and so, fifteen volumes). The results of a topic model fitted to individuals covered by volume I may be explored at themacroscope.org/interactive/dcbvol1. Examining that visualization (which is done via a radial layout, grouping members of a community or module as spokes along a wheel), we see that this is primarily a volume about the men leading New France. If we take the inverse view, and look at the patterning of topics at themacroscope.org/interactive/dcbvol1-topics , the module with the topics with highest betweeness contains these topics:

  •  made, years, return, back, man
  • order, gave, hand religious, church
  • France, Frontenac, Talon, intendant, governor
  • Bay, governor, French, HBC, Hudson,
  • Time, trade, men, colony, left
  • England, Newfoundland, London, Sir, William

Volume 1 is concerned principally about governance, which ties nicely to that overall vision of the Dictionary of Canadian Biography that we generated previously as being concerned with the story of the governing of Canada (recall the topic with the highest betweeness of all 8,000 biographies).

We can explore the results of the topic models generated at the level of the individual volumes by visualizing the interrelationships of topic words and proportions as a series of histograms. In these dendrograms, the feature being graphed is the distribution of topic words over the corpus for that volume; topics with a similar distribution within biographies are grouped together. In the network visualization, the groups are formed on the basis of overall correlations, of which topics seem to be positively associated one with another (that is, the presence of one will imply the presence of another, to varying amounts) . The dendrograms are presented online as a slideshow (at http://www.slideshare.net/DoctorG/topics-over-volumes-dcb), to allow the viewer to scroll through and compare. A quick sense of the ‘most important’ documents, from this perspective, can be accomplished by watching the topic that always appear in the top left side of the dendrogram. Volume I touches constantly the figure of La Salle; Volume V covers the period from 1801-1820, and the major topic is ‘british, American, new, York, great’ – a topic relating to the War of 1812.

Topics over volumes dcb

All of these output files [will be] are provided  via the Historian’s Macroscope github page, and we encourage the reader to download these and explore for herself the patterns. After all, if the point of this work is to generate new kinds of insights, then we who create this new data must share it in order for others to have the tools to do so.