Working with Large Topic Models

The topic model we use to examine the EPSRC Grants on the Web (GOW) database of projects consists of 600 topics. We have had some success in introducing relevance scores to help navigate these by research area group / research area:

Topic Browser

However, it would be desirable to have an high level overview of the topic model describing the whole portfolio. Here we describe three possible approaches.

Small Topic Model of Large Topic Model

We used the 20 words in each of our 600 topics as individual documents and generated a topic model from these. As topics seek to cut across documents in the corpus to discover common components, the new model was highly general and meaningless in terms of the subject matter of their member topics.

Small Topic Model of Large Topic Model (Extended)

We concatenated all documents associated with each topic in the 600 topic model giving a much larger/richer set of 600 documents which by definition contained all of the terms from the corpus from which the original model was derived.

The result of making a 60 topic model from these results was far more successful. As with documents to topics, this approach results in topics from the large model being a member of one or more topics in the smaller model. The result of this model can be browsed here:

** Unavailable at the moment***
Super Topic Browser

Hierarchical Clustering

By taking the mean tfidf vector (see this post for a description of tfidf) weighted by topic proportion of each document associate with each topic, we can derive a tfidf vector for each topic, and therefore a similarity matrix built from the cosine distance between each topic vector.

Hierarchical analysis was performed on this data. As the resulting dendrogram was skewed, clustering to 60 groups resulted in one cluster containing some 492 of the 600 topics, and therefore not a good means of providing an overview of the underlying model.


K-Means Clustering

By using the same similarity matrix derived in the section on Hierarchical Clustering, we can cluster the topics using the k-means approach. This met with more success in terms of balance of cluster size (min: 2, max: 29, mean: 10) and cohesiveness of cluster content. The result of this analysis can be viewed by clicking the following link. Note in this case each topic is a member of only one cluster.

** Unavailable at the moment***
K-Means Topic Browser

In contrast with the small topic model of large topic model (extended) approach, the clustering approaches require us to choose some appropriate name or label for the cluster, which would ideally be based on collaboration by stakeholders.