My previous post described the TF-IDF technique. Now we’re going to put it all together to build a working text classifier.
Our simple classifier is based on a similarity (or distance) function. All we need to do is compare an unknown document with each known document. When we know what known document is most similar to our unknown document, then we can make a pretty good guess about what class the unknown document belongs to.
Let’s define a similarity function:
let similarity unknown known = let getTerms = Seq.map fst >> Set.ofSeq let getWeights = Seq.map snd let unknownTerms = getTerms unknown let unknownWeights = getWeights unknown let knownTerms = getTerms known let knownWeights = getWeights known let commonTerms = Set.intersect unknownTerms knownTerms let isCommonTerm term = commonTerms |> Set.exists (fun w -> w = fst term) let commonWeights tfidfs = tfidfs |> Seq.filter isCommonTerm |> getWeights let commonUnknownWeights = commonWeights unknown let commonKnownWeights = commonWeights known (Seq.sum commonKnownWeights + Seq.sum commonUnknownWeights) / (Seq.sum unknownWeights + Seq.sum knownWeights)
Now a play by play analysis:
- We have two arguments: an
unknowndocument, or a sequence of
TFIDFvalues, and a
knowndocument, also a sequence of
getTermsis a function to “decompose” a sequence of
TFIDFis a tuple with the first (
fst) item being a
Term. This output will be a set of
getWeightsis about the same, only it gets the second (
snd) value from the pair: the actual TF-IDF weighting.
- Next we’ll get the terms and weights of the
unknowndocument, and then the
commonTermsis the intersection set of both known and unknown
isCommonTermis a function that tells us if a given
Termis in the set of
commonWeightswill take a sequence of TF-IDF values and only give back the ones that correspond to the set of common terms.
- Next we’ll get the common weights for both the known and unknown document:
- Finally, we calculate the similarity between the weights of both documents. This will give us a number between 0 and 1, with 1 meaning the two documents are identical.
Finding a match
Now we can evaluate our similarity function between our unknown document and every unknown document:
let calcCategoryProbabilities (trainedData: TrainedData) query = //calculate similarity score per category, using highest score from each let calcSim ws = similarity query ws let scoreSamples samples = samples |> Seq.map calcSim let scores = trainedData |> Seq.map (fun kvp -> kvp.Key, (scoreSamples kvp.Value) |> Seq.max) //sort scores descending scores |> Seq.sortBy snd |> List.ofSeq |> List.rev
- Our two arguments are
trainedData, representing the entirety of our training set, and
queryis our unknown document (a sequence of
calcSimis a partially applied function; a shorthand for calculating the similarity between each weighted sample (
ws) and our
scoreSampleswill give us a sequence of similiarities between a sequence of weighted samples and our
scoresis the result of feeding our
mapfunction that will give us the highest similarity score per document category/class.
- Finally, we sort the
scoresin descending order, giving us a set of each document category and how similar our
querydocument is to each one. The first result would be the document category that
queryis most similar to.
We’ve built a very rudimentary document classification program. It was written for tutorial purposes, so it’s not highly optimized or sophisticated. You could replace the similarity/distance function with something fancier. You could extract different features besides TF-IDF. The best thing you could do is pick one of many existing libraries for machine learning and learn how to use it instead.
Hopefully this series of posts has given you some insight and maybe even inspiration to learn more about machine learning. Happy programming!
Get all the F# code for this series here.