You are on the training database
You are on the training database
Principal investigator: Liam
Level: Personal research project
Topic modeling is very popular at the moment in the digital humanities. A recent tutorial on getting started with this tool explains them as tools for extracting topics or injecting semantic meaning into vocabularies: "Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry - that is, any kind of unstructured text" (Graham, Weingart, and Milligan 2012).
In that tutorial, 'unstructured' means that there is no encoding in the text by which a computer can model any of its semantic meaning. Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. In the field, things can be rather different, and this material can be considered to be 'largely unstructured' despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a 'casserole', thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen - there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus.
David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a 'document' and the artefacts therein as the 'tokens' or 'words' within that document, for the purposes of topic modeling. The resulting 'topics' of this analysis are what he calls 'vocabularies' of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, 'the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis....mathematically concrete in its biases'. The 'casseroles' of Pompeii turn out to have nothing to do with food preparation, in Mimno's analysis. To date, this is the only example of topic modeling applied to archaeological data. As such, it is novel in the digital humanities for applying the tools of data mining not to texts, but to things. In this paper, I explore the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK [subject to permission].
The Portable Antiquities Scheme is a project "to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales". To date, there are over half a million unique records in the Scheme's database. I use topic modeling of this database to tease out archaeological patterns -the discourses of topic modeling, to use Ted Underwood's phrasing - over both time and space. In order to visualize these discourses, I map them both in geographic and relational space, using the network analysis program Gephi . The constellation of ideas (the resultant 'topics') that make up the various discourses in the data can be represented as nodes while the strengths of the associations suggested by the topic model can be represented as edges. This two-mode graph (words and 'topics' or 'discourses') can be queried for deeper structure. I look at the modularity of this graph to determine 'communities' of ideas or discourses. I then lay this network against real geographic space by time-slice [please note that this does not require the exact location of findspots; general parish or other larger-scale localizations can work] to understand changes over time and space in the Portable Antiquities Scheme data. I agree with Mimno's suggestion that this is an appropriate tool for the digital archaeologist, but try to understand the limitations, caveats, and lessons for digital humanities more generally, from this application.