# Simple text classification with a little F# (Part 2)

My previous post laid out the basics for calculating term frequencies from text documents. This post will describe a method for supplementing term frequency that will make it more useful.

## Why term frequency is not enough

The number of times a term appears in a document was perhaps the most obvious statistic we could derive from our bag of words model. Once you start gathering this stat from a variety of documents, you may realize that certain terms are *ubiquitous*. Because they appear in nearly every document, they’re almost useless as a signifier of a given document’s class.

There’s another useful statistic that can combat this problem by augmenting our term frequencies: inverse document frequency.

### Term Frequency - Inverse Document Frequency

We can negatively weight a term’s frequency relative to the number of times it appears in *all* documents. For example, if your corpus is all about real estate, the word `address`

may have an extraordinary document frequency. We want to negatively weight that word to decrease its importance.

Here’s a function that will calculate the document frequency of a given term in a `TrainingSample`

set:

```
let docFreq allSamples term =
let sampleHasTerm sample = sample.Frequencies |> Seq.exists (fun w -> fst w = term)
allSamples |> Seq.filter sampleHasTerm |> Seq.length
```

And a simple function for calculating the inverse document frequency (adding 1 to `docFrequency`

will avoid division by zero when a term isn’t found in a set of documents):

```
let calcInverseDocFreq sampleCount docFrequency =
Math.Log (float sampleCount / (float docFrequency + 1.0))
```

We’ll define a new tuple for this using `float`

, since the TF-IDF value won’t be a whole number.

```
type TFIDF = Term * float
```

Words like

`the`

,`and`

,`with`

, etc. aren’t very useful for text classification purposes, regardless of your subject matter. These are commonly referred to as stop words. You would generally create a blacklist of these words to be filtered from your input before you calculate TF-IDF.

### Crunching numbers

Using the functions we’ve just defined, we can analyze all our samples and create a map of IDF values for all our terms:

```
let getTermIDFMap allSamples terms =
let docFreq = docFreq allSamples
let sampleCount = Seq.length allSamples
let idf = docFreq >> calcInverseDocFreq sampleCount
terms
|> Array.ofSeq
|> Array.Parallel.map (fun t -> t, idf t)
|> Map.ofSeq
```

In summary, we’re taking a set of all samples and all terms as input, and returning a `Map`

of each `Term`

to its IDF value. Let’s break down what’s going on in this function:

- The
`docFreq`

function is being partially applied. This allows us to reference a shorthand version of this function that will always take`allSamples`

as its first argument. - We’re defining an
`idf`

function that composes our partially applied`docFreq`

with a partially applied`calcInverseDocFreq`

. Functional composition allows us to pipeline these partially applied functions, creating a new function that takes only one argument of interest:`term`

. Note that the order of function arguments plays an important role in function composition. This particular example wouldn’t work if`term`

wasn’t the last argument to each composed function. - Notice the use of
`Array.Parallel.map`

. This will perform our`map`

functions in parallel, greatly reducing the time required to compute IDF when multiple cores are available!

#### Augmenting Term Frequency with Inverse Document Frequency

This function takes our `Map`

of `Term`

to IDF values, and a `TermFrequency`

, and gives us a weighted `TermFrequency`

:

```
let tfidf (idfs: Map<Term, float>) tf =
let term, frequency = tf
let hasWeight = idfs.ContainsKey term
let idf = if hasWeight then idfs.[term] else 1.0
term, float frequency * idf
let weightTermFreqs idfs = Seq.map (tfidf idfs)
```

Thanks to TF-IDF, common terms will now carry less weight in our model.

## Next steps

We’ve got the TF-IDF feature extraction covered. In the next post we’ll start comparing our extracted features to build a working classifier!