Topic Modelling Resources

We have begun topic modelling of the EPSRC ICT portfolio using Mallet. Our initial data can be downloaded below. the *-data files have grant numbers as the filename (‘/’ replaced with ‘_’) and file contents are the grant abstract without any processing applied.

Our output is based on importing the data to Mallet using the ‘–stoplist-file stops.txt’ option (see stops.txt file below containing the Cornell Stop Words. We have processed the data for 50, 100, 150, 200, 250 and 300 topics.

Principal Component Analysis of ICT Terms

This application / visualisation is new and unfinished so it is referenced here while we finesse it. This 3d model plots the ICT grants after dimensionality reduction using PCA. Clicking on a data point loads the particular grant in your browser. Two versions are available:

Interactive PCA Plot
Animated PCA Plot

The plot shows grants in the following colours: CORDIS, EPSRC & NSF.

You will need an x3d plugin to view these plots.

Interactive Dendrogram

A hierarchy of significant grant terms for each level in the EPSRC Browser and ICT Browser Applications can be displayed by clicking on any word cloud representing a collection of grants. The interactive dendrogram application used to display the data was developed by David Robb as his masters project and is used here with his kind permission.

Here are some examples demonstrating the hierarchical data (click on the word cloud so see it):
* Unavailable at the moment *

Data: TFIDF Vectors & Similarity Matrices

This blog entry shares details of the tfidf and similarity data for the corpus.

File ‘_terms.csv’ gives the term mapping that can be found in the ‘*-tfidf.csv’ files. Its seven columns represent the term, t, T, D, d, log(D/d) and t/T * log(D/d) respectively.

Each pair of ‘*-refs.csv’ and ‘*-tfidf.csv’ files map the grant reference numbers and tfidf vectors respectively at various levels in the database. Levels are described with the convention ‘awardingAgency-programme-subprogramme’ with a single ‘x’ representing the wildcard operator. ‘LEVELS-refs.csv’ and ‘LEVELS-tfidf.csv’ map to the individual tfidf vectors for the concatenated documents within levels (each treated as an individual document). Finally, there is a similarity matrix calculated from the tfidf vectors using the cosine distance function for the entire database of grants and concatenated levels (‘x-x-x-sim.csv’ and ‘LEVELS-sim.csv’).

_terms.csv [2,159 KB]
EPSRC-Basic Technology-x-refs.csv [1 KB]
EPSRC-Basic Technology-x-tfidf.csv [2,746 KB]
EPSRC-Chemistry-x-refs.csv [3 KB]
EPSRC-Chemistry-x-tfidf.csv [15,071 KB]
EPSRC-Complexity-x-refs.csv [0 KB]
EPSRC-Complexity-x-tfidf.csv [724 KB]
EPSRC-Crime Prevention-x-refs.csv [0 KB]
EPSRC-Crime Prevention-x-tfidf.csv [862 KB]
EPSRC-Cross-Discipline Interface-x-refs.csv [2 KB]
EPSRC-Cross-Discipline Interface-x-tfidf.csv [9,880 KB]
EPSRC-Digital Economy-x-refs.csv [1 KB]
EPSRC-Digital Economy-x-tfidf.csv [3,338 KB]
EPSRC-Energy Multidisciplinary Applications-x-refs.csv [1 KB]
EPSRC-Energy Multidisciplinary Applications-x-tfidf.csv [5,346 KB]
EPSRC-Energy Research Capacity-x-refs.csv [1 KB]
EPSRC-Energy Research Capacity-x-tfidf.csv [5,883 KB]
EPSRC-Energy-x-refs.csv [1 KB]
EPSRC-Energy-x-tfidf.csv [4,943 KB]
EPSRC-Engineering and Physical Sciences Research Council-x-refs.csv [2 KB]
EPSRC-Engineering and Physical Sciences Research Council-x-tfidf.csv [8,871 KB]
EPSRC-Engineering-x-refs.csv [5 KB]
EPSRC-Engineering-x-tfidf.csv [24,640 KB]
EPSRC-High Performance Computing-x-refs.csv [0 KB]
EPSRC-High Performance Computing-x-tfidf.csv [134 KB]
EPSRC-IDEAS Factory-x-refs.csv [0 KB]
EPSRC-IDEAS Factory-x-tfidf.csv [2,255 KB]
EPSRC-Information and Communications Technology-x-refs.csv [11 KB]
EPSRC-Information and Communications Technology-x-tfidf.csv [53,627 KB]
EPSRC-Infrastructure and Environment-x-refs.csv [0 KB]
EPSRC-Infrastructure and Environment-x-tfidf.csv [1,809 KB]
EPSRC-Infrastructure and International-x-refs.csv [1 KB]
EPSRC-Infrastructure and International-x-tfidf.csv [4,324 KB]
EPSRC-Innovative Manufacturing-x-refs.csv [0 KB]
EPSRC-Innovative Manufacturing-x-tfidf.csv [1,387 KB]
EPSRC-Integrated Knowledge-x-refs.csv [0 KB]
EPSRC-Integrated Knowledge-x-tfidf.csv [135 KB]
EPSRC-Life Sciences Interface-x-refs.csv [1 KB]
EPSRC-Life Sciences Interface-x-tfidf.csv [4,294 KB]
EPSRC-Materials, Mechanical and Medical Engineering-x-refs.csv [5 KB]
EPSRC-Materials, Mechanical and Medical Engineering-x-tfidf.csv [22,334 KB]
EPSRC-Materials-x-refs.csv [4 KB]
EPSRC-Materials-x-tfidf.csv [17,182 KB]
EPSRC-Mathematical Sciences-x-refs.csv [5 KB]
EPSRC-Mathematical Sciences-x-tfidf.csv [23,685 KB]
EPSRC-Nanoscience through engineering to application-x-refs.csv [1 KB]
EPSRC-Nanoscience through engineering to application-x-tfidf.csv [3,466 KB]
EPSRC-Next Generation Healthcare-x-refs.csv [0 KB]
EPSRC-Next Generation Healthcare-x-tfidf.csv [1,006 KB]
EPSRC-Physical Sciences-x-refs.csv [8 KB]
EPSRC-Physical Sciences-x-tfidf.csv [40,950 KB]
EPSRC-Physics-x-refs.csv [2 KB]
EPSRC-Physics-x-tfidf.csv [9,308 KB]
EPSRC-Postgraduate Training-x-refs.csv [7 KB]
EPSRC-Postgraduate Training-x-tfidf.csv [34,403 KB]
EPSRC-Process Environment and Sustainability-x-refs.csv [3 KB]
EPSRC-Process Environment and Sustainability-x-tfidf.csv [13,768 KB]
EPSRC-Public Engagement-x-refs.csv [1 KB]
EPSRC-Public Engagement-x-tfidf.csv [4,251 KB]
EPSRC-Technology Programme-x-refs.csv [1 KB]
EPSRC-Technology Programme-x-tfidf.csv [3,163 KB]
EPSRC-UNCLASSIFIED-x-refs.csv [491 KB]
EPSRC-UNCLASSIFIED-x-tfidf.csv [2,346,319 KB]
EPSRC-User Led Skills and Knowledge Flow-x-refs.csv [3 KB]
EPSRC-User Led Skills and Knowledge Flow-x-tfidf.csv [13,709 KB]
EPSRC-User-Led Research-x-refs.csv [5 KB]
EPSRC-User-Led Research-x-tfidf.csv [22,995 KB]
LEVELS-refs.csv [6 KB]
LEVELS-sim.csv [144 KB]
LEVELS-tfidf.csv [16,479 KB]
x-Information and Communications Technology-x-refs.csv [93 KB]
x-Information and Communications Technology-x-tfidf.csv [503,097 KB]
x-x-x-refs.csv [648 KB]
x-x-x-sim.csv [17,000,026 KB]
x-x-x-tfidf.csv [3,156,279 KB]

Shared Terms

A project stakeholder suggested highlighting shared terms in grant details pages where the grant details are displayed as a result of following a ‘most similar’ link. This has now been implemented and is illustrated by the following examples.

NSF1028831 Click on the first four ‘most similar’ grants in turn (NSF1028888 NSF1029025 NSF1029030 NSF1029783)
These five grants share an identical abstract so all words are highlighted.

EP/F02553X/1 Click on the first ‘most similar’ link (EP/E001769/1)
Highlighted terms: automated broad centre century cost development enable essential industry innovative key lower manufacturing photonic production provide renewable research strategies technology

Other Investigators

A project stakeholder suggested extending the person search to find co-investigators and other investigators associated with grants. Due to the way our database was obtained, this is only possible for the EPSRC grants and the requested functionality has been added. Now, these individuals are listed with grant details and person details list grants where the person is a principal investigator, co-investigator or other investigator. As a result, the relevant table has grown from 12,603 to 19,895 records.

Person Search

We have introduced a Person Search application where users can search for Principal Investigators associated with grants across the whole database. Results are supplied in two categories: EPSRC Results and Other Results (currently CORDIS and NSF).

Click here to launch the Person Search.

We have also improved the out-links from Grant Details pages so that a Person Details page, where users can browse other grants held by that person, is always available.

Ten (or so) Most Similar Grants

We have constructed a similarity matrix describing the relationship between all 48,175 grants in the Perspectives database. This information has been used to link each grant to the ten or so most similar grants in the database. These are available to users when they are viewing grant details for any grant. The similarity matrix was constructed using the cosine distance function on the tfidf vectors.

Temporal Development of Portfolio

Two new word cloud based applications have been introduced to allow inspection of how the EPSRC and ICT (EPSRC, CORDIS & NSF) portfolios have evolved over the last 5 years. For each year there are word clouds to represent the 100 or so terms which have the highest TFIDF values in the portfolio for that year, and two word clouds to represent the 100 or so terms that represent the differences (increasing and decreasing) between one year and the previous.