Working with Large Topic Models

The topic model we use to examine the EPSRC Grants on the Web (GOW) database of projects consists of 600 topics. We have had some success in introducing relevance scores to help navigate these by research area group / research area:

Topic Browser

However, it would be desirable to have an high level overview of the topic model describing the whole portfolio. Here we describe three possible approaches.

Small Topic Model of Large Topic Model

We used the 20 words in each of our 600 topics as individual documents and generated a topic model from these. As topics seek to cut across documents in the corpus to discover common components, the new model was highly general and meaningless in terms of the subject matter of their member topics.

Small Topic Model of Large Topic Model (Extended)

We concatenated all documents associated with each topic in the 600 topic model giving a much larger/richer set of 600 documents which by definition contained all of the terms from the corpus from which the original model was derived.

The result of making a 60 topic model from these results was far more successful. As with documents to topics, this approach results in topics from the large model being a member of one or more topics in the smaller model. The result of this model can be browsed here:

** Unavailable at the moment***
Super Topic Browser

Hierarchical Clustering

By taking the mean tfidf vector (see this post for a description of tfidf) weighted by topic proportion of each document associate with each topic, we can derive a tfidf vector for each topic, and therefore a similarity matrix built from the cosine distance between each topic vector.

Hierarchical analysis was performed on this data. As the resulting dendrogram was skewed, clustering to 60 groups resulted in one cluster containing some 492 of the 600 topics, and therefore not a good means of providing an overview of the underlying model.

dendrogram

K-Means Clustering

By using the same similarity matrix derived in the section on Hierarchical Clustering, we can cluster the topics using the k-means approach. This met with more success in terms of balance of cluster size (min: 2, max: 29, mean: 10) and cohesiveness of cluster content. The result of this analysis can be viewed by clicking the following link. Note in this case each topic is a member of only one cluster.

** Unavailable at the moment***
K-Means Topic Browser

In contrast with the small topic model of large topic model (extended) approach, the clustering approaches require us to choose some appropriate name or label for the cluster, which would ideally be based on collaboration by stakeholders.




Two More Similarity by TFIDF Features (GOW++)

We have recently added two more features driven by TFIDF similarity to the GOW++ applications. The first is to provide the 20 most similar people in the person details page. This is based around treating the grants each person is associated with as a single document and calculating the distance between each of these document collections using tfidf.

The second is to list the 5 most similar research areas to the grants page when navigated to from the research areas page. The following example shows that the Optical Devices and Subsystems research area of the ICT theme is also closely related to the Light matter interaction and optical phenomena research area of the Physical Sciences theme and the Sensors and Instrumentation research area of the Engineering theme…


GOW++

The EPSRC has made a database of all current research grant descriptions available to the ICT Perspectives project. It has also been agreed that monthly updates to this data will be made available in the future.

We are working on adapting our existing apps and perspectives to work with the new data, as well as developing the processing steps such that updates can be processed with as much automation as possible.

There are significant differences between the structure and schema of our previous database and the database supplied by EPSRC. Although this will delay the availability of our apps but these will be supplied as quickly as our limited resources allow.

The new database and apps have been labelled GOW++ and can be viewed by clicking on the menu item on the page header.

Please consider GOW++ features to be at risk of downtime while we work on this.




Topic Plotting

A new feature has been added to visualise and examine the topic model. Two topics are selected by the user and grants belonging to both topics are plotted on a scatter chart with each axis representing the proportion those topics contribute to the grant. It is easy to distinguish grants from different awarding agencies and individual agencies can be switched on and off by clicking the key. Hovering over grants supply the grant number and topic proportions and clicking a grant navigates to the grant details page. The plot can also be scaled by selecting an area.

You can select your own topics to plot by launching the Topic Plotter