Text generation made functional

Text generation made functional

Lorem Ipsum with F#

Introduction

On one of the projects for generating test data, instead of having a random string, I thought it would be a good idea to write a tiny library with a couple of useful helper functions to generate what is known as Lorem Ipsum in the printing and typesetting industry. And because I was mainly interested in the text data generation and not in the layout of it, I would like to share with you the functional way of implementing this task.

Approach

There are various ways to approach dummy text generation. I will show you a naive implementation that is just enough to get the job done. Lorem Ipsum is not simply random text and it has roots in a piece of classical Latin literature from 45 BC. We take the most cited piece and make it our dummy text generation foundation. Here is the quote:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Implementation

You can find source code for this post on my GitHub account

Let's jump directly into the code. First, define a Lorem Ipsum as a constant:

[<Literal>]
let loremipsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Next, let's define a couple of types that we use for setting up generation configuration:

type CapitalizeMode =
    | OnlyFirst
    | All
    | Lowercase

type WordCount = 
    | Arbitrary of min:int * max:int
    | Exact of int

type LoremIpsumOption = {
    Capitalize: CapitalizeMode
    Count: WordCount
}

This is pretty self-explanatory - it defines 2 discriminated unions for the words case mode and amount of words we would like to generate, which could be the exact or arbitrary amount in a range. The record combines both DUs and will be input for the words generation function getWords which is defined as:

let getWords option =
    let rec getWordsUtil acc n = 
        match n with
        | 0 -> acc
        | _ -> getWordsUtil (getRandomLoremIpsum()::acc) (n - 1)

    let firstToUpper str =
        match str with
        | [] -> []
        | x::xs -> (upper x) :: xs

    match (option.Capitalize, option.Count) with
    | (OnlyFirst, Arbitrary (min, max)) -> 
        getWordsUtil [] (getRandom (min, max)) |> firstToUpper
    | (All, Arbitrary (min, max)) -> 
        getWordsUtil [] (getRandom (min, max)) |> List.map upper
    | (Lowercase, Arbitrary (min, max)) -> 
        getWordsUtil [] (getRandom (min, max))
    | (OnlyFirst, Exact count) -> 
        getWordsUtil [] count |> firstToUpper        
    | (All, Exact count) -> 
        getWordsUtil [] count |> List.map upper
    | (Lowercase, Exact count) -> getWordsUtil [] count

Let's split this function. First, there is the getWordsUtil nested function defined - it is a recursive function and defined with the rec keyword. It also uses additional parameters for accumulator along with n which is a sign of applying tail recursion. Tail recursion is an optimization technique that solves the problem of stack-overflow in classic counter-part. It is also the last call in the function where acc in our case is just an empty list. So invocation could look like this: getWordsUtil [] 10 if we need to generate 10 words. Internally, it builds up a list of words while reducing n to zero which you can see in this line: getWordsUtil (getRandomLoremIpsum()::acc) (n - 1). Let's assume for now that getRandomLoremIpsum() returns some random word. In getRandomLoremIpsum()::acc, :: is the syntax for prepending a new value to the list which is our accumulator. When n hits 0 function returns accumulator. The second helper function is firstToUpper which converts the first element of a list into an upper-case version. In the last expression of the thegetWords function, we parse the option argument with pattern matching and build arguments for getWordsUtil. Depending on the required amount of words we either use getRandom (min, max) to get words in provided range or use the exact value from count.

We have used getRandomLoremIpsum to generate a random word. Here is the definition of that function:

let private getRandomLoremIpsum =
    let rnd = Random()
    fun () -> loremIpsumWords.[rnd.Next(loremIpsumWords.Length)] |> lower

It returns another function which when invoked returns back one of the Lorem Ipsum words from the dictionary, making it lowercase.

So now we have a working word generator. It would be nice if we could combine words in sentences and paragraphs.

Said that we could build additional functionality on top of getWords by composing pieces together. That's what the getSentence function could look like:

let getSentence = 
    fun () -> getWords { Capitalize = OnlyFirst; Count = Arbitrary (5, 20) } |> String.concat " "

Following by that one:

let getSentences n = 
    if n <= 0 then failwith $"{nameof(n)} should be a positive number, more than zero."
    seq { for _ in 1..n -> getSentence () }
    |> String.concat ". "

Now we could build paragraphs from what we have already:

let getParagraphs n = 
    if n <= 0 then failwith $"{nameof(n)} should be a positive number, more than zero."
    let rnd = Random()
    seq { for _ in 1..n -> getSentences (rnd.Next (3,6)) }
    |> String.concat ("." + Environment.NewLine)

Composition in action! In the code, I choose the simplest option - throw an exception on boundary violation. This is not the idiomatically functional way of working with errors. The reason for this is because the signature of the function is:

int -> string

It expects int and returns string. There is no mention here that function could fail for any reason. That does not fit well with the expression style function invocation. The better way would be to return option or result type where Some or Ok denotes success and None or Error - failure. An even better option would be to disallow invalid values on a type-level by introducing something like PositiveInteger. Expressiveness and power of F# easy allow to do this, but that's a bit more advanced topic.

Finally, we are done! Let's check it works:

getWords { Capitalize = Lowercase; Count = Exact 10 }

["duis"; "ex"; "proident"; "non"; "lorem"; "dolore"; "consequat"; "velit"; "deserunt"; "aute"]

getWords { Capitalize = OnlyFirst; Count = Arbitrary (5, 20) }

["Occaecat"; "veniam"; "nostrud"; "dolore"; "cupidatat"; "magna"; "in"; "consequat"; "consectetur"; "nisi"]

getWords { Capitalize = All; Count = Arbitrary (5, 20) }

["Enim"; "Ex"; "Enim"; "Incididunt"; "Elit"; "Aliquip"; "Pariatur"; "Ea"; "Do"; "Irure"; "Sint"; "Labore"; "Elit"; "Consectetur"; "Reprehenderit"]

getSentence ()

"Ut in laboris mollit occaecat nulla cillum in adipiscing deserunt consectetur"

getSentences 5

"Ut laborum consectetur proident sunt ut tempor anim deserunt laboris tempor sit in sunt excepteur et. Eiusmod veniam fugiat ea proident excepteur magna sed anim. Ut ex nostrud laborum ullamco ex. Pariatur nulla eu eu tempor aliquip eu do deserunt nostrud amet ut dolore cupidatat. Dolor eiusmod consectetur eu consequat sunt"

getParagraphs 2

"Culpa labore deserunt velit laboris anim ut aliquip aliqua. Adipiscing elit eiusmod enim adipiscing mollit ea quis reprehenderit et in officia nulla nulla exercitation ullamco eu. Nisi laboris ea est exercitation dolore esse et minim magna sunt commodo do deserunt ut dolore et. Consectetur id ea ut do velit dolore exercitation mollit. Commodo sed proident ex sunt mollit consectetur enim voluptate cillum in. Nostrud ut consectetur consectetur cupidatat laboris ex velit in ea aliqua et ullamco aliquip excepteur duis occaecat id aliqua. Ex consectetur ut aute deserunt pariatur incididunt velit consectetur dolor consequat duis mollit anim ullamco ad eiusmod sit sint. Ea et ut ad in sed sunt laboris fugiat qui deserunt deserunt dolore ut elit"

Testing

Self-confidence in the working code is good, but we need proof. I will show only a few examples here, the rest you will find in GitHub repository.

[<Fact>]
let ``getWords with capitalize set to Lowercase should return all words in lowercase`` () =
    let words = getWords { Capitalize = Lowercase; Count = Exact 5 }
    Assert.True(words |> List.forall(fun w -> Char.IsLower(w.[0])))
[<Fact>]
let ``getWords with capitalize set to OnlyFirst should return only first word capitalized`` () =
    let words = getWords { Capitalize = OnlyFirst; Count = Exact 5 }
    words |> List.iteri(
        fun i w -> 
            match i with 
            | 0 -> Assert.True(Char.IsUpper(w.[0]))
            | _ -> Assert.True(Char.IsLower(w.[0]))
[<Theory>]
[<InlineData(2,8)>]
[<InlineData(3,5)>]
[<InlineData(5,10)>]
[<InlineData(8,8)>]
let ``getWords with Arbitrary amount should return amount of words in the range`` (min, max) =
    let words = getWords { Capitalize = Lowercase; Count = Arbitrary (min,max) }
    Assert.InRange(words.Length, min, max)

More fun fun

In the code, you could have noticed various utility functions which I didn't post here to keep the post clean and short. But I would like to highlight one interesting function which is memoize:

let private memoize (f: 'a -> 'b) =
    // Storage for the calculated result
    let cache = new Dictionary<_, _>()
    // Check if it's in the cache or not
    (fun x ->
        match cache.TryGetValue(x) with
        | true, cachedValue -> cachedValue
        | _ -> 
            let result = f x
            cache.Add(x, result)
            result)

From Wiki:

Memoization is an optimization technique used to speed up programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

Well, we could use it to avoid repetitive splitting loremipsum string into separate words on each getRandomLoremIpsum invocation:

let loremIpsumWords = 
    loremipsum
    |> memoize (fun x -> split x)

The memoize function is generic and works with any kind of input. In our case input is the whole Lorem Ipsum string and output is a list of words.

Summary

We have applied functional programming and F# to implement the very practical task of text data generation. Of course, there are a lot of solutions to that including FsCheck and especially its generators, but it is always nice to try it yourself and learn from it.