Topic modelling through latent Dirichlet allocation (LDA)
We are in the early stages of using topic modelling to build up a picture of the EE corpus. Trying to 'categorize' letters is never going to do them justice, especially with many of our documents covering a multitude of subjects. In just one letter a writter may talk about a scientific descovery, the weather, the rather bad state of their helath and then finish off by discussing their next visit into town! Using topic modeling the software builds up a picture of which words are associated with a certain topic and then by itterating over the whole corpus it can start to identify similar letters.
We want to be able to allow users to be able to approach the correspondence in EE from as many angles as possible and, so far, using latent Dirichlet allocation to group documents on similar topics has been very successful.
Topics as a collection of words:
printed book send copy
edition work volume print books copies
publish printing sheets published press paper
volumes bookseller works
ship board ships
sea men french captain fleet war land
boat port officers service sail wind
island expedition vessel
motion line body weight force air
bodies equal water point earth
velocity matter speed gravity square ball
See the following online materials for a detailed discussion
- Learning Author-Topic Models from Text Corpora, Michal Rosen-Zvi, IBM Research Lab, Haifa, Chaitanya Chemudugunta, University of California, Irvine, et al.;
- Reading the Topic Modeling Literature, Maryland Institute for Technology in the Humanities (August 19, 2011 at 11:40am);
- Revealing the relationships between topics in a corpus, Ted Underwood;
- Topic Based Text Segmentation Goodies, ARTFL Project Research Blog (October 4, 2009);
- Topic Modeling in the Humanities: An Overview, Maryland Institute for Technology in the Humanities (August 1, 2011 at 7:16am);
- Wikipedia article on Latent_Dirichlet_allocation.