My previous post laid out the basics for calculating term frequencies from text documents. This post will describe a method for supplementing term frequency that will make it more useful.

## Why term frequency is not enough

The number of times a term appears in a document was perhaps the most obvious statistic we could derive from our bag of words model. Once you start gathering this stat from a variety of documents, you may realize that certain terms are ubiquitous. Because they appear in nearly every document, they’re almost useless as a signifier of a given document’s class.

There’s another useful statistic that can combat this problem by augmenting our term frequencies: inverse document frequency.

### Term Frequency - Inverse Document Frequency

We can negatively weight a term’s frequency relative to the number of times it appears in all documents. For example, if your corpus is all about real estate, the word `address` may have an extraordinary document frequency. We want to negatively weight that word to decrease its importance.

Here’s a function that will calculate the document frequency of a given term in a `TrainingSample` set:

``````let docFreq allSamples term =
let sampleHasTerm sample = sample.Frequencies |> Seq.exists (fun w -> fst w = term)
allSamples |> Seq.filter sampleHasTerm |> Seq.length
``````

And a simple function for calculating the inverse document frequency (adding 1 to `docFrequency` will avoid division by zero when a term isn’t found in a set of documents):

``````let calcInverseDocFreq sampleCount docFrequency =
Math.Log (float sampleCount / (float docFrequency + 1.0))
``````

We’ll define a new tuple for this using `float`, since the TF-IDF value won’t be a whole number.

``````type TFIDF = Term * float
``````

Words like `the`, `and`, `with`, etc. aren’t very useful for text classification purposes, regardless of your subject matter. These are commonly referred to as stop words. You would generally create a blacklist of these words to be filtered from your input before you calculate TF-IDF.

### Crunching numbers

Using the functions we’ve just defined, we can analyze all our samples and create a map of IDF values for all our terms:

``````let getTermIDFMap allSamples terms =
let docFreq = docFreq allSamples
let sampleCount = Seq.length allSamples
let idf = docFreq >> calcInverseDocFreq sampleCount
terms
|> Array.ofSeq
|> Array.Parallel.map (fun t -> t, idf t)
|> Map.ofSeq
``````

In summary, we’re taking a set of all samples and all terms as input, and returning a `Map` of each `Term` to its IDF value. Let’s break down what’s going on in this function:

1. The `docFreq` function is being partially applied. This allows us to reference a shorthand version of this function that will always take `allSamples` as its first argument.
2. We’re defining an `idf` function that composes our partially applied `docFreq` with a partially applied `calcInverseDocFreq`. Functional composition allows us to pipeline these partially applied functions, creating a new function that takes only one argument of interest: `term`. Note that the order of function arguments plays an important role in function composition. This particular example wouldn’t work if `term` wasn’t the last argument to each composed function.
3. Notice the use of `Array.Parallel.map`. This will perform our `map` functions in parallel, greatly reducing the time required to compute IDF when multiple cores are available!

#### Augmenting Term Frequency with Inverse Document Frequency

This function takes our `Map` of `Term` to IDF values, and a `TermFrequency`, and gives us a weighted `TermFrequency`:

``````let tfidf (idfs: Map<Term, float>) tf =
let term, frequency = tf
let hasWeight = idfs.ContainsKey term
let idf = if hasWeight then idfs.[term] else 1.0
term, float frequency * idf

let weightTermFreqs idfs = Seq.map (tfidf idfs)
``````

Thanks to TF-IDF, common terms will now carry less weight in our model.

## Next steps

We’ve got the TF-IDF feature extraction covered. In the next post we’ll start comparing our extracted features to build a working classifier!