My previous post described the TF-IDF technique. Now we’re going to put it all together to build a working text classifier.

Similarities

Our simple classifier is based on a similarity (or distance) function. All we need to do is compare an unknown document with each known document. When we know what known document is most similar to our unknown document, then we can make a pretty good guess about what class the unknown document belongs to.

Let’s define a similarity function:

let similarity unknown known =
    let getTerms = Seq.map fst >> Set.ofSeq
    let getWeights = Seq.map snd

    let unknownTerms = getTerms unknown
    let unknownWeights = getWeights unknown

    let knownTerms = getTerms known
    let knownWeights = getWeights known

    let commonTerms = Set.intersect unknownTerms knownTerms
    let isCommonTerm term = commonTerms |> Set.exists (fun w -> w = fst term)
    let commonWeights tfidfs = tfidfs |> Seq.filter isCommonTerm |> getWeights
    
    let commonUnknownWeights = commonWeights unknown
    let commonKnownWeights = commonWeights known

    (Seq.sum commonKnownWeights + Seq.sum commonUnknownWeights)
    / (Seq.sum unknownWeights + Seq.sum knownWeights)

Now a play by play analysis:

  • We have two arguments: an unknown document, or a sequence of TFIDF values, and a known document, also a sequence of TFIDF values.
  • getTerms is a function to “decompose” a sequence of TFIDF values. A TFIDF is a tuple with the first (fst) item being a Term. This output will be a set of Term values.
  • getWeights is about the same, only it gets the second (snd) value from the pair: the actual TF-IDF weighting.
  • Next we’ll get the terms and weights of the unknown document, and then the known document: unknownTerms, unknownWeights, knownTerms, knownWeights.
  • commonTerms is the intersection set of both known and unknown Terms.
  • isCommonTerm is a function that tells us if a given Term is in the set of commonTerms.
  • commonWeights will take a sequence of TF-IDF values and only give back the ones that correspond to the set of common terms.
  • Next we’ll get the common weights for both the known and unknown document: commonUnknownWeights and commonKnownWeights.
  • Finally, we calculate the similarity between the weights of both documents. This will give us a number between 0 and 1, with 1 meaning the two documents are identical.

Finding a match

Now we can evaluate our similarity function between our unknown document and every unknown document:

let calcCategoryProbabilities (trainedData: TrainedData) query =        
    //calculate similarity score per category, using highest score from each
    let calcSim ws = similarity query ws
    let scoreSamples samples = samples |> Seq.map calcSim
    let scores =
        trainedData
        |> Seq.map (fun kvp -> kvp.Key, (scoreSamples kvp.Value) |> Seq.max)

    //sort scores descending
    scores |> Seq.sortBy snd |> List.ofSeq |> List.rev
  • Our two arguments are trainedData, representing the entirety of our training set, and query is our unknown document (a sequence of TFIDF).
  • calcSim is a partially applied function; a shorthand for calculating the similarity between each weighted sample (ws) and our query document.
  • scoreSamples will give us a sequence of similiarities between a sequence of weighted samples and our query document.
  • scores is the result of feeding our trainedData into a map function that will give us the highest similarity score per document category/class.
  • Finally, we sort the scores in descending order, giving us a set of each document category and how similar our query document is to each one. The first result would be the document category that query is most similar to.

All done

We’ve built a very rudimentary document classification program. It was written for tutorial purposes, so it’s not highly optimized or sophisticated. You could replace the similarity/distance function with something fancier. You could extract different features besides TF-IDF. The best thing you could do is pick one of many existing libraries for machine learning and learn how to use it instead.

Hopefully this series of posts has given you some insight and maybe even inspiration to learn more about machine learning. Happy programming!

Get all the F# code for this series here.