My previous post laid out the basics for calculating term frequencies from text documents. This post will describe a method for supplementing term frequency that will make it more useful.
Why term frequency is not enough
The number of times a term appears in a document was perhaps the most obvious statistic we could derive from our bag of words model. Once you start gathering this stat from a variety of documents, you may realize that certain terms are ubiquitous. Because they appear in nearly every document, they’re almost useless as a signifier of a given document’s class.
There’s another useful statistic that can combat this problem by augmenting our term frequencies: inverse document frequency.
Term Frequency - Inverse Document Frequency
We can negatively weight a term’s frequency relative to the number of times it appears in all documents. For example, if your corpus is all about real estate, the word
address may have an extraordinary document frequency. We want to negatively weight that word to decrease its importance.
Here’s a function that will calculate the document frequency of a given term in a
let docFreq allSamples term = let sampleHasTerm sample = sample.Frequencies |> Seq.exists (fun w -> fst w = term) allSamples |> Seq.filter sampleHasTerm |> Seq.length
And a simple function for calculating the inverse document frequency (adding 1 to
docFrequency will avoid division by zero when a term isn’t found in a set of documents):
let calcInverseDocFreq sampleCount docFrequency = Math.Log (float sampleCount / (float docFrequency + 1.0))
We’ll define a new tuple for this using
float, since the TF-IDF value won’t be a whole number.
type TFIDF = Term * float
with, etc. aren’t very useful for text classification purposes, regardless of your subject matter. These are commonly referred to as stop words. You would generally create a blacklist of these words to be filtered from your input before you calculate TF-IDF.
Using the functions we’ve just defined, we can analyze all our samples and create a map of IDF values for all our terms:
let getTermIDFMap allSamples terms = let docFreq = docFreq allSamples let sampleCount = Seq.length allSamples let idf = docFreq >> calcInverseDocFreq sampleCount terms |> Array.ofSeq |> Array.Parallel.map (fun t -> t, idf t) |> Map.ofSeq
In summary, we’re taking a set of all samples and all terms as input, and returning a
Map of each
Term to its IDF value. Let’s break down what’s going on in this function:
docFreqfunction is being partially applied. This allows us to reference a shorthand version of this function that will always take
allSamplesas its first argument.
- We’re defining an
idffunction that composes our partially applied
docFreqwith a partially applied
calcInverseDocFreq. Functional composition allows us to pipeline these partially applied functions, creating a new function that takes only one argument of interest:
term. Note that the order of function arguments plays an important role in function composition. This particular example wouldn’t work if
termwasn’t the last argument to each composed function.
- Notice the use of
Array.Parallel.map. This will perform our
mapfunctions in parallel, greatly reducing the time required to compute IDF when multiple cores are available!
Augmenting Term Frequency with Inverse Document Frequency
This function takes our
Term to IDF values, and a
TermFrequency, and gives us a weighted
let tfidf (idfs: Map<Term, float>) tf = let term, frequency = tf let hasWeight = idfs.ContainsKey term let idf = if hasWeight then idfs.[term] else 1.0 term, float frequency * idf let weightTermFreqs idfs = Seq.map (tfidf idfs)
Thanks to TF-IDF, common terms will now carry less weight in our model.
We’ve got the TF-IDF feature extraction covered. In the next post we’ll start comparing our extracted features to build a working classifier!