A Markov chain models a series of random transitions from one state to another, with the next state depending only on the current state. This idea has plenty of applications, one of them happens to be generating weird text, and we’ll do just that in this post.

The accompanying F# library project is here.

## Judging a book by every word between its covers

Imagine we have a large book, made of sentences, made of words. If we consider this book to be a sequence of words and word transitions, we can imagine building a Markov model from it.

Our first requirement is to be able to break a big book string into individual words:

``````let splitOnSpace text = Regex.Split(text, @"\s+")

let getWordPairs pairSize filePath =
|> splitOnSpace
|> Seq.windowed pairSize
``````

Some of these techniques are also applied in my previous text classification posts. You may recognize that we’re forming n-grams from the text using a sliding window (`Seq.windowed`), where n is `pairSize`. If our book is really big then we may consider a streamed/lazily evaluated approach, but this is a blog post.

At this point you need a book. May I suggest one of the timeless classics available from Project Gutenberg? I’ll use Franz Kafka’s rather neurotic Metamorphosis—where the main character Jeff Goldblum builds a machine to transform himself into a roach.

Let us build a windowed sequence of roach poetry, with trigrams:

``````let wordPairs = getWordPairs 3 "/Users/Taylor/Documents/Metamorphosis.txt"
``````

That’ll give you a sliding window sequence of arrays like so:

``````[[|"One"; "morning,"; "when"|]; [|"morning,"; "when"; "Gregor"|];
[|"when"; "Gregor"; "Samsa"|]; [|"Gregor"; "Samsa"; "woke"|]; ...]
``````

That windowed sequence represents the first words of the book: “One morning, when Gregor Samsa woke”.

We need a function to help us decompose those word arrays, which will give us a tuple where the first value is every word but the last, and the second value is the last word:

``````let bisectWords (arr:_[]) =
let len = Array.length arr
let preds = arr |> Seq.take (len - 1)
(preds |> joinWords, arr.[len - 1])

// example output
> wordPairs |> Seq.nth 0 |> bisectWords;;
val it : string * string = ("One morning,", "when")
``````

## Climbing Mt. Markov

For each distinct word, we can find the words directly after it throughout the book. We can construct a `Map<string, string list>`, i.e. each distinct word and a list of the words that can follow it.

This is probably the hardest part to grasp; maybe best to read it from bottom to top. We’re going to `fold` over the windowed sequence of words and construct our Markov `Map` as we go. That `Map` will be threaded through the `fold` operation as the accumulated value, and modified with each pass. In `updateMap`, if a particular word (or key) already exists in the `Map` then we replace the binding, otherwise we add a new binding.

``````let updateMap (map:Map<_,_>) key value =
if map.ContainsKey key then
let existingValue = map.[key]
let map = map |> Map.remove key
map |> Map.add key (value :: existingValue)
else

let mapBuilder map words = bisectWords words ||> updateMap map
let buildMap = Seq.fold mapBuilder Map.empty
``````

In an imperative language this might be written as an iterative loop that modifies a mutable dictionary, and we have that option in F#, but this approach uses immutable data structures.

Let’s actually construct the map, which shouldn’t take more than a second:

``````let map = buildMap wordPairs
``````

Now we’ll partition our map into two pieces:

1. Bindings that can start a sentence
2. Bindings that can follow the start of a sentence

We’ll need a function to tell us whether a word begins a sentence or not, and we’ll use it to partition the map:

``````let isStartSentence = Regex(@"^[A-Z]").IsMatch // match any title-cased word
let startWords, otherWords = map |> Map.partition (fun k _ -> isStartSentence k)
``````

Now that we have a `Map` of only “start” words, we can randomly pick one to start a sentence:

``````let startWordArray = map |> Seq.map (fun kvp -> kvp.Key) |> Array.ofSeq

let rng = Random()
let getRandomItem seq =
let randIndex = rng.Next (Seq.length seq)
seq |> Seq.nth randIndex
``````

## Chaining it up

We’re dangerously close to actually generating gibberish sentences. All we need is a function to tell us when we’ve reached the end of a sentence:

``````// just look for ending punctuation, but don't match 'Mr.' or 'Mrs.'
let (|IsEndOfSentence|) = Regex(@"(?<!Mr(s)?)[\.\?\!]\$").IsMatch
``````

And then we can write a recursive function to walk our Markov map, accumulating randomly chosen words for each state transition, and then joining them together to form a sentence:

``````// we'll need this function to work with larger word pair sizes
let combineWords prev next =
[prev; next]
|> List.filter (fun s -> not (String.IsNullOrWhiteSpace s))
|> joinWords

// recursive function to accumulate Markov sentence until end-condition reached
let rec markovChain state acc =
let nextChoice = map.[state] |> getRandomItem
match nextChoice with
| IsEndOfSentence true ->
nextChoice :: acc
| IsEndOfSentence false ->
let currWords = state |> splitOnSpace |> Seq.skip 1 |> joinWords
markovChain (combineWords currWords nextChoice) (nextChoice :: acc)

let getMarkovSentence startWord =
markovChain startWord [startWord]
|> List.rev
|> joinWords
``````

This has some similarities to our recursive `fold` method of building the Markov map, in that it’s working with a current state and an accumulator. The `markovChain` will go on indefinitely, randomly choosing the next element and accumulating until reaching its stop condition.

## Hot Kafka

How many sentences of this stuff do we want? The sky’s the limit. We can generate more Kafkaesque material in milliseconds than Franz himself could conjure up in a million years!

``````let sentences = [for i in 1 .. 3 do yield getMarkovSentence (getRandomItem startWords).Key]
Notice if you increase the size of the word pairings, e.g. `getWordPairs 4`, the generated material starts to resemble the original text more closely—this is because each state of your Markov map is more specific. Using a word pair size of 4 or 5 will often regenerate sentences from the source text verbatim. Using a value of 2 will generate much more random looking text, because each state consists of only one word. This will also depend on the amount and variety of source text used to build your Markov model.