Topic modelling through latent Dirichlet allocation (LDA)

We are in the early stages of using topic modelling to build up a picture of the EE corpus. Trying to 'categorize' letters is never going to do them justice, especially with many of our documents covering a multitude of subjects. In just one letter a writter may talk about a scientific descovery, the weather, the rather bad state of their helath and then finish off by discussing their next visit into town! Using topic modeling the software builds up a picture of which words are associated with a certain topic and then by itterating over the whole corpus it can start to identify similar letters.

We want to be able to allow users to be able to approach the correspondence in EE from as many angles as possible and, so far, using latent Dirichlet allocation to group documents on similar topics has been very successful.

Topics as a collection of words:

printed book send copy
edition work volume print books copies
publish printing sheets published press paper
volumes bookseller works


ship board ships
sea men french captain fleet war land
boat port officers service sail wind
island expedition vessel


motion line body weight force air
bodies equal water point earth
velocity matter speed gravity square ball

See the following online materials for a detailed discussion

help : login