We have begun topic modelling of the EPSRC ICT portfolio using Mallet. Our initial data can be downloaded below. the *-data files have grant numbers as the filename (‘/’ replaced with ‘_’) and file contents are the grant abstract without any processing applied.
Our output is based on importing the data to Mallet using the ‘–stoplist-file stops.txt’ option (see stops.txt file below containing the Cornell Stop Words. We have processed the data for 50, 100, 150, 200, 250 and 300 topics.