Simple text classification with a little F# (Part 3)
My previous post described the TF-IDF technique. Now we’re going to put it all together to build a working text classifier.
Similarities
Our simple classifier is based on a similarity (or distance) function. All we need to do is compare an unknown document with each known document. When we know what known document is most similar to our unknown document, then we can make a pretty good guess about what class the unknown document belongs to.
Let’s define a similarity function:
let similarity unknown known =
let getTerms = Seq.map fst >> Set.ofSeq
let getWeights = Seq.map snd
let unknownTerms = getTerms unknown
let unknownWeights = getWeights unknown
let knownTerms = getTerms known
let knownWeights = getWeights known
let commonTerms = Set.intersect unknownTerms knownTerms
let isCommonTerm term = commonTerms |> Set.exists (fun w -> w = fst term)
let commonWeights tfidfs = tfidfs |> Seq.filter isCommonTerm |> getWeights
let commonUnknownWeights = commonWeights unknown
let commonKnownWeights = commonWeights known
(Seq.sum commonKnownWeights + Seq.sum commonUnknownWeights)
/ (Seq.sum unknownWeights + Seq.sum knownWeights)
Now a play by play analysis:
- We have two arguments: an
unknown
document, or a sequence ofTFIDF
values, and aknown
document, also a sequence ofTFIDF
values. getTerms
is a function to “decompose” a sequence ofTFIDF
values. ATFIDF
is a tuple with the first (fst
) item being aTerm
. This output will be a set ofTerm
values.getWeights
is about the same, only it gets the second (snd
) value from the pair: the actual TF-IDF weighting.- Next we’ll get the terms and weights of the
unknown
document, and then theknown
document:unknownTerms
,unknownWeights
,knownTerms
,knownWeights
. commonTerms
is the intersection set of both known and unknownTerms
.isCommonTerm
is a function that tells us if a givenTerm
is in the set ofcommonTerms
.commonWeights
will take a sequence of TF-IDF values and only give back the ones that correspond to the set of common terms.- Next we’ll get the common weights for both the known and unknown document:
commonUnknownWeights
andcommonKnownWeights
. - Finally, we calculate the similarity between the weights of both documents. This will give us a number between 0 and 1, with 1 meaning the two documents are identical.
Finding a match
Now we can evaluate our similarity function between our unknown document and every unknown document:
let calcCategoryProbabilities (trainedData: TrainedData) query =
//calculate similarity score per category, using highest score from each
let calcSim ws = similarity query ws
let scoreSamples samples = samples |> Seq.map calcSim
let scores =
trainedData
|> Seq.map (fun kvp -> kvp.Key, (scoreSamples kvp.Value) |> Seq.max)
//sort scores descending
scores |> Seq.sortBy snd |> List.ofSeq |> List.rev
- Our two arguments are
trainedData
, representing the entirety of our training set, andquery
is our unknown document (a sequence ofTFIDF
). calcSim
is a partially applied function; a shorthand for calculating the similarity between each weighted sample (ws
) and ourquery
document.scoreSamples
will give us a sequence of similiarities between a sequence of weighted samples and ourquery
document.scores
is the result of feeding ourtrainedData
into amap
function that will give us the highest similarity score per document category/class.- Finally, we sort the
scores
in descending order, giving us a set of each document category and how similar ourquery
document is to each one. The first result would be the document category thatquery
is most similar to.
All done
We’ve built a very rudimentary document classification program. It was written for tutorial purposes, so it’s not highly optimized or sophisticated. You could replace the similarity/distance function with something fancier. You could extract different features besides TF-IDF. The best thing you could do is pick one of many existing libraries for machine learning and learn how to use it instead.
Hopefully this series of posts has given you some insight and maybe even inspiration to learn more about machine learning. Happy programming!
Get all the F# code for this series here.