Simple text classification with a little F# (Part 2)
My previous post laid out the basics for calculating term frequencies from text documents. This post will describe a method for supplementing term frequency that will make it more useful.
Why term frequency is not enough
The number of times a term appears in a document was perhaps the most obvious statistic we could derive from our bag of words model. Once you start gathering this stat from a variety of documents, you may realize that certain terms are ubiquitous. Because they appear in nearly every document, they’re almost useless as a signifier of a given document’s class.
There’s another useful statistic that can combat this problem by augmenting our term frequencies: inverse document frequency.
Term Frequency - Inverse Document Frequency
We can negatively weight a term’s frequency relative to the number of times it appears in all documents. For example, if your corpus is all about real estate, the word address
may have an extraordinary document frequency. We want to negatively weight that word to decrease its importance.
Here’s a function that will calculate the document frequency of a given term in a TrainingSample
set:
let docFreq allSamples term =
let sampleHasTerm sample = sample.Frequencies |> Seq.exists (fun w -> fst w = term)
allSamples |> Seq.filter sampleHasTerm |> Seq.length
And a simple function for calculating the inverse document frequency (adding 1 to docFrequency
will avoid division by zero when a term isn’t found in a set of documents):
let calcInverseDocFreq sampleCount docFrequency =
Math.Log (float sampleCount / (float docFrequency + 1.0))
We’ll define a new tuple for this using float
, since the TF-IDF value won’t be a whole number.
type TFIDF = Term * float
Words like
the
,and
,with
, etc. aren’t very useful for text classification purposes, regardless of your subject matter. These are commonly referred to as stop words. You would generally create a blacklist of these words to be filtered from your input before you calculate TF-IDF.
Crunching numbers
Using the functions we’ve just defined, we can analyze all our samples and create a map of IDF values for all our terms:
let getTermIDFMap allSamples terms =
let docFreq = docFreq allSamples
let sampleCount = Seq.length allSamples
let idf = docFreq >> calcInverseDocFreq sampleCount
terms
|> Array.ofSeq
|> Array.Parallel.map (fun t -> t, idf t)
|> Map.ofSeq
In summary, we’re taking a set of all samples and all terms as input, and returning a Map
of each Term
to its IDF value. Let’s break down what’s going on in this function:
- The
docFreq
function is being partially applied. This allows us to reference a shorthand version of this function that will always takeallSamples
as its first argument. - We’re defining an
idf
function that composes our partially applieddocFreq
with a partially appliedcalcInverseDocFreq
. Functional composition allows us to pipeline these partially applied functions, creating a new function that takes only one argument of interest:term
. Note that the order of function arguments plays an important role in function composition. This particular example wouldn’t work ifterm
wasn’t the last argument to each composed function. - Notice the use of
Array.Parallel.map
. This will perform ourmap
functions in parallel, greatly reducing the time required to compute IDF when multiple cores are available!
Augmenting Term Frequency with Inverse Document Frequency
This function takes our Map
of Term
to IDF values, and a TermFrequency
, and gives us a weighted TermFrequency
:
let tfidf (idfs: Map<Term, float>) tf =
let term, frequency = tf
let hasWeight = idfs.ContainsKey term
let idf = if hasWeight then idfs.[term] else 1.0
term, float frequency * idf
let weightTermFreqs idfs = Seq.map (tfidf idfs)
Thanks to TF-IDF, common terms will now carry less weight in our model.
Next steps
We’ve got the TF-IDF feature extraction covered. In the next post we’ll start comparing our extracted features to build a working classifier!