Monthly Archives: July 2011

Grant Vectorisation: Introducing TFIDF

A significant part of this project involves statistical language modelling. Essentially, this allows us to estimate the similarity of or distance between grants or collections of grants. We will use a widely used word vectorisation approach called TFIDF (term-frequency, inverse document frequency).

Initially we produce a word vector consisting of all terms found in the context we wish to examine. We then construct a vector for every document (grant) or collection of documents (grants) where each element in the vector is calculated thus:

tfidf = t/T x log(D/d)

t = the number of occurrences of the term in a document (or collection of documents)
T = the total number of terms occurring in the document (or collection of documents)
D = the total number of documents
d = the total number of documents in which the term appears at least once

We can then estimate the pairwise distance between documents or collections of documents using the cosine distance function: one minus the cosine of the included angle between observations.

A note about ‘Derived’ Terms

We will use the developer blog to describe notable terminology that may require some explanation. The first such entry concerns the terms stopping, stemming and deriving.

Term Description
Stopping The process of filtering out stop words from a piece of natural language data. It is controlled by human input and not automated. Although there is no definitive list of stop words, the Cornell Stop Words are widely accepted by the natural language processing community as being complete and sufficient for most purposes.
Stemming The process of reducing inflected (or sometimes derived) words to their stem, base or root form. This is important in document search or comparison where we wish related words to be directly compared. A widely accepted implementation is the Porter Stemming Algorithm, which we will adopt in this project.
Deriving When visualising results of analysis where text has been subjected to stemming, we encountered the problem that stems do not always ‘suggest’ the words they represent, and indeed the stem in some cases is not a word at all. To counter this we have developed a lookup table so that after any processing, stems may be restored to their derived form using the most common usage across the whole dataset (all grants).