Simple text classification with a little F# (Part 2)
My previous post laid out the basics for calculating term frequencies from text documents. This post will describe a method for supplementing term frequency that will make it more useful.
Why term frequency is not enough
The number of times a term appears in a document was perhaps the most obvious statistic we could derive from our bag of words model. Once you start gathering this stat from a variety of documents, you may realize that certain terms are ubiquitous. Because they appear in nearly every document, they’re almost useless as a signifier of a given document’s class.
There’s another useful statistic that can combat this problem by augmenting our term frequencies: inverse document frequency.
Term Frequency - Inverse Document Frequency
We can negatively weight a term’s frequency relative to the number of times it appears in all documents. For example, if your corpus is all about real estate, the word address may have an extraordinary document frequency. We want to negatively weight that word to decrease its importance.
Here’s a function that will calculate the document frequency of a given term in a TrainingSample set:
let docFreq allSamples term =
let sampleHasTerm sample = sample.Frequencies |> Seq.exists (fun w -> fst w = term)
allSamples |> Seq.filter sampleHasTerm |> Seq.length
And a simple function for calculating the inverse document frequency (adding 1 to docFrequency will avoid division by zero when a term isn’t found in a set of documents):
let calcInverseDocFreq sampleCount docFrequency =
Math.Log (float sampleCount / (float docFrequency + 1.0))
We’ll define a new tuple for this using float, since the TF-IDF value won’t be a whole number.
type TFIDF = Term * float
Words like
the,and,with, etc. aren’t very useful for text classification purposes, regardless of your subject matter. These are commonly referred to as stop words. You would generally create a blacklist of these words to be filtered from your input before you calculate TF-IDF.
Crunching numbers
Using the functions we’ve just defined, we can analyze all our samples and create a map of IDF values for all our terms:
let getTermIDFMap allSamples terms =
let docFreq = docFreq allSamples
let sampleCount = Seq.length allSamples
let idf = docFreq >> calcInverseDocFreq sampleCount
terms
|> Array.ofSeq
|> Array.Parallel.map (fun t -> t, idf t)
|> Map.ofSeq
In summary, we’re taking a set of all samples and all terms as input, and returning a Map of each Term to its IDF value. Let’s break down what’s going on in this function:
- The
docFreqfunction is being partially applied. This allows us to reference a shorthand version of this function that will always takeallSamplesas its first argument. - We’re defining an
idffunction that composes our partially applieddocFreqwith a partially appliedcalcInverseDocFreq. Functional composition allows us to pipeline these partially applied functions, creating a new function that takes only one argument of interest:term. Note that the order of function arguments plays an important role in function composition. This particular example wouldn’t work iftermwasn’t the last argument to each composed function. - Notice the use of
Array.Parallel.map. This will perform ourmapfunctions in parallel, greatly reducing the time required to compute IDF when multiple cores are available!
Augmenting Term Frequency with Inverse Document Frequency
This function takes our Map of Term to IDF values, and a TermFrequency, and gives us a weighted TermFrequency:
let tfidf (idfs: Map<Term, float>) tf =
let term, frequency = tf
let hasWeight = idfs.ContainsKey term
let idf = if hasWeight then idfs.[term] else 1.0
term, float frequency * idf
let weightTermFreqs idfs = Seq.map (tfidf idfs)
Thanks to TF-IDF, common terms will now carry less weight in our model.
Next steps
We’ve got the TF-IDF feature extraction covered. In the next post we’ll start comparing our extracted features to build a working classifier!