# The joys and perils of beating C with Haskell: productionizing wc

Last time
we’ve looked at implementing a toy `wc`

-like program
and we’ve also compared its performance against the full-blown Unix `wc`

.
The results were quite interesting:
our implementation managed to beat `wc`

by a factor of 5.
Of course, that’s quite an unfair comparison:
our implementation is hardcoded to count just the bytes, lines and words.
`wc`

, on the other hand, has command-line options to select specific statistics,
it supports some additional ones like maximum line length,
it treats Unicode spaces properly (in an Unicode-aware locale, of course),
and so on.
In other words, it’s better to consider what we’ve done last time
as a proof-of-concept showing that it’s possible to achieve (and overcome)
C-like performance on this task, even if with all those concessions.

Today we’ll look at ways of productionizing the toy program from the previous post. Our primary goal here is allowing the user to select various statistics, computing just what the user has selected to compute. We’ll try to do this in a modular and composable way, striving to isolate each statistic into its own unit of some sorts.

Indeed, if we look at the C version — well, personally I wouldn’t call that as a prime example of readable and maintainable code, as different statistics are computed in a single big 370-lines-long function. This is something we’ll try to avoid here.

Moreover, we’ll try to express that certain statistics like byte count or lines count
can be computed more efficiently if we don’t have to look at each byte,
while other statistics like word count or max line length just *need* to look at each byte one by one
(unless one does some clever and non-trivial broadword programming or SIMD-enabled things,
which is beyond the scope of this post).
For instance, byte count can be computed in `O(1)`

if we know we’re reading from a file —
we can just take the file size and call it a day!

In addition to that, we will, among other things:

- implement more statistics with ease, enjoying local reasoning;
- throw up some tests, enjoying local reasoning once more;
- try out some kinda-dependently-typed techniques, successfully obtaining working code but failing spectacularly on the performance side of things;
- play around with Template Haskell;
- marvel at the (un)predictability and (un)reproducibility of the resulting code performance.

So, here’s the code from the previous post that we begin with:

```
{-# LANGUAGE Strict #-}
{-# LANGUAGE RecordWildCards #-}
module Data.WordCount where
import qualified Data.ByteString.Lazy as BS
data State = State
{ bs :: Int
, ws :: Int
, ls :: Int
, wasSpace :: Int
}
wc :: BS.ByteString -> (Int, Int, Int)
wc s = (bs, ws + 1 - wasSpace, ls)
where
State { .. } = BS.foldl' go (State 0 0 0 1) s
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addLine | c == 10 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
{-# INLINE wc #-}
```

We want to break it down into separate functions computing bytes, words and lines count respectively. How can we do this?

## Composable left folds

First of all, note that what we’re doing is a fold —
hey, even the name of the function `BS.foldl'`

tells us so!

Some time ago I’ve come across foldl —
a library for "composable, streaming, and efficient Left folds".
This sounds like exactly what we need!
Moreover, luckily for us, it even has a
module
for folding over `ByteString`

s.
In particular, it already has functions for 2 of the 3 statistics we’ve implemented:
bytes count is equal to the input string length which is `length`

in said module,
while `count`

allows counting the lines (via `count 10`

).
Looks like we just need to write a fold to count the words, and we’re golden!

So here is the said fold:

```
{-# LANGUAGE Strict #-}
import qualified Control.Foldl as L
import qualified Data.ByteString as BS
data WordState = WordState { ws :: Int, wasSpace :: Int }
wordsCount :: L.Fold BS.ByteString Int
wordsCount = L.Fold (BS.foldl' go) (WordState 0 1) (\WordState { .. } -> ws + 1 - wasSpace)
where
go WordState { .. } c = WordState (ws + addWord) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
addWord = (1 - wasSpace) * isSp
```

Now we can express counting bytes, lines and words simultaneously as

```
import qualified Control.Foldl.ByteString as BL
main :: IO ()
main = do
[path] <- getArgs
contents <- unsafeMMapFile path
let res = BL.fold ((,,) <$> BL.length <*> BL.count 10 <*> wordsCount) (BSL.fromStrict contents) :: (Int, Int, Int)
print res
```

Composable indeed! But what is the run time?
If I benchmark it as described in the previous post (running on a 1.8 gigabytes file residing in `tmpfs`

),
I get about 2.5 seconds on my machine.
Note it’s a different machine from the one I did the benchmarks for the previous post —
it’s my development machine with a different CPU and more sources of noise (IDE, browser, etc),
but it’ll do to get a rough idea of the performance.

Ok, so 2.5 seconds. That’s almost twice as bad as it was before.

What if we count just the length and words?

```
let res = BL.fold ((,) <$> BL.length <*> wordsCount) contents :: (Int, Int)
print res
```

1.55 seconds. Hmm. What about just the lines count?

```
let res = BL.fold (BL.count 10) contents :: Int
print res
```

1.05 seconds.

Dang. It’s almost additive. But it *shouldn’t* be additive.
The lines count (boiling down to counting `'\n'`

) should be dwarved
by the way more complicated logic of words counting, but it isn’t.

Something strange’s going on here.
Let’s get our hands dirty and look at what the `foldl`

library does.

In general,
`foldl`

feeds a chunk of input to each of the folds composed `Applicative`

ly.
`foldl`

folds over lazy `ByteString`

s,
and a chunk here is the strict `ByteString`

that the lazy one consists of.
In this particular case, that’s about 256 KiB worth of data,
which definitely does not fit L1 cache, so we end up paying quite some CPU cycles to wait
for L2 → L1 cache and even L1 cache → register transfers
(since otherwise the data we’re working with is readily available).

Sure, we could have just decreased the size of the chunk to 16-32 KiB so that it fits L1,
but that’s not as *interesting*.

And, what’s worse, the compiler seemingly isn’t able to elide unnecessary computations:
the run time of `BL.fold ((,) <$> wordsCount <*> wordsCount) contents`

is *twice* as much as
the run time of `BL.fold wordsCount contents`

.

Anyway, that approach is not gonna fly.

Moreover, it’s not entirely obvious how to compose the folds “on demand”, say, based on the program command-line options.

So let’s write…

## Our very own folds

Before we actually get to coding, let’s think a little about the design. Remember that ultimately we need to compose the folds according to what the user has selected. How it might even look like?

Let’s say we have base, “atomic” folds (that is, statistics) `f1`

, `f2`

, `f3`

at our disposal,
and the user requests the composition of `f1`

and `f3`

.
I wouldn’t really expect the compiler to be smart enough to see that code like

```
-- options is a list of booleans per individual statistic
options <- parseCliOptions
-- theFold is the resulting composition
let theFold = foldl' f (zip options [f1, f2, f3]) emptyFold
where
f acc (True, stat) = acc `compose` stat
f acc (False, _) = acc
```

implies that the compiler needs to specialize all the compositions of folds that might arise.

The only way to be sure the compiler has all the relevant information is to lift this to the type level — Haskell ultimately gets rid of the types during compile-time, so any information conveyed via them must be usable by the compiler.

### Types to the rescue

Given that, how do we represent a fold? One type-friendly way of doing this is via a type class.

And, before we get to that, let’s just enumerate the type system extensions we’ll need to get them out of the way:

```
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses, UndecidableInstances, FlexibleInstances #-}
{-# LANGUAGE TypeFamilyDependencies, FunctionalDependencies, PolyKinds, DataKinds, GADTs, TypeOperators #-}
{-# LANGUAGE ScopedTypeVariables #-}
```

So what constitutes a fold in our case, which is some statistic over the input? A fold-computed statistic is defined by:

- the initial state,
- the step function updating the state according to an input character,
- the function to extract the final result from the state,
- the state and the result types.

Let’s write this down:

```
class Statistic s res st | res -> s, st -> s
, s -> res, s -> st
where
initState :: st
extractState :: st -> res
step :: st -> Word8 -> st
```

here `s`

denotes the statistics identifier
(it will be a `DataKinds`

-thing later on),
and we use functional dependencies to denote that
the statistic uniquely determines the result type and the state type,
and vice versa.
In particular, this allows us to avoid
spelling out all of the class type variables in the methods signatures,
and it suffices to mention just the ones that we actually need.
Otherwise we’d need something like

```
initState :: proxy1 s -> proxy2 res -> st
extractState :: proxy s -> st -> res
step :: proxy1 s -> proxy2 res -> st -> Word8 -> st
```

Ugh. Ugly. Good we don’t need to spell out as much.

### Chunked statistics

Anyway, this one looks like a good start, but something’s missing.
Let’s think how we might express the bytes count statistic, for instance.
If the user has requested just this statistic,
then there’s no point in folding over the input and incrementing a counter on each step.
Instead, we can just take the length of the whole input chunk (as in `BS.length`

)
and call it a day.
That’s O(1) on the input length.

On the other hand, the word counting or, even worse, max line length counting statistic does not have an easily expressible similarly “chunked” computation mode. They necessarily need to look at each byte of the input (modulo some SIMD implementations that we don’t consider here — but they still are O(n) on the input).

Moreover, we’ll later want to combine different statistics so that if both stats support chunked mode of computation, then we still do chunked, but otherwise we degrade to byte-level.

So how do we express that some statistics support both “chunked” computations as well as byte-level ones,
while others only support the latter? GADTs to the rescue!
We introduce an enumeration to denote the sort, or *computation mode* of the statistic,
and we also introduce a GADT wrapping the functions that make sense for that sort.
Or, in code:

```
data StatCompTyOf = Chunked | ByteOnly
data StatComputation st compTy where
ChunkedComputation :: (st -> Word8 -> st)
-> (st -> BS.ByteString -> st)
-> StatComputation st 'Chunked
ByteOnlyComputation :: (st -> Word8 -> st)
-> StatComputation st 'ByteOnly
```

Here (and in the subsequent text) `BS`

is the strict `ByteString`

module.

We also modify our `Statistic`

class to account for that
(note the new `comp`

type parameter):

```
class Statistic s res st comp | res -> s, st -> s
, s -> res, s -> st, s -> comp
where
initState :: st
extractState :: st -> res
computation :: StatComputation st comp
```

Here functional dependencies help again.
And, in general, it’s sufficient to know any of `s`

, `res`

or `st`

to infer the rest of the variables.

One other thing worth noting is that we could generalize this approach and have, for instance, three types of computations: our current byte-only and chunked ones plus SIMD-accelerated that are looking at 16-32 bytes at the same time. I’m keeping just two types for simplicity, though.

### Stats implementation

What statistics do we have? I choose to implement

- byte count,
- (UTF-8) characters count,
- word count,
- max line length,
- lines count.

Or in code:

`data Statistics = Bytes | Chars | Words | MaxLL | Lines deriving (Eq, Ord)`

This should be more than enough to illustrate the approach.

The base statistics are just instances of the `Statistic`

class.
And, since most statistics will have just an integer as their state,
let’s introduce a convenience typed wrapper for the state and result types:

`newtype Tagged a = Tagged Word64 deriving (Eq, Show, Num)`

here the `a`

is just a dummy parameter to distinguish, say,
`Tagged 'Bytes`

from `Tagged 'Chars`

.

Given these definitions, let’s start with the most straightforward one: the bytes count. In code:

```
instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
initState = 0
extractState = id
computation = ChunkedComputation (\st _ -> st + 1) (\st str -> st + fromIntegral (BS.length str))
```

This is probably quite self-documenting:

- We say that
`Bytes`

denotes a statistic having`Tagged 'Bytes`

both as state type and result type and supporting chunked computations. - The initial state (that is, the byte count) is
`0`

. - The extraction function doesn’t do anything fancy and just returns the accumulated count.
- The computation is necessarily a
`ChunkedComputation`

, since we said`'Chunked`

on the first line. The step function ignores the input and just increments the counter, while the chunk function adds the length of the input to the counter.

Easy-peasy so far.

The rest of the statistics are implemented similarly. The implementations are boring, so I’m hiding them under a spoiler, but feel free to## check them out

Counting lines is similarly straightforward, and this statistic also supports both byte-only and chunked computations:

```
instance Statistic 'Lines (Tagged 'Lines) (Tagged 'Lines) 'Chunked where
initState = 0
extractState = id
computation = ChunkedComputation (\st c -> st + if c == 10 then 1 else 0) (\st str -> st + fromIntegral (BS.count 10 str))
```

What about counting words? This one supports just the byte-only computation, and we largely borrow the implementation from the last post:

```
data WordsState = WordsState { ws :: Word64, wasSpace :: Word64 }
instance Statistic 'Words (Tagged 'Words) WordsState 'ByteOnly where
initState = WordsState 0 1
extractState WordsState { .. } = Tagged (ws + 1 - wasSpace)
computation = ByteOnlyComputation step
where
step WordsState { .. } c = WordsState (ws + (1 - wasSpace) * isSp) isSp
where
isSp | c == 32 || c - 9 <= 4 = 1
| otherwise = 0
```

This is a good example of a statistic with a non-trivial extraction function.

Cool, we’ve managed to port the stats that we already had. What about the new ones: the UTF-8 chars count and the max line length?

The first one is a bit more involved, but all the complexity is about the right bit mangling:

```
instance Statistic 'Chars (Tagged 'Chars) (Tagged 'Chars) 'ByteOnly where
initState = 0
extractState = id
computation = ByteOnlyComputation $ \cnt c -> cnt + 1 - fromIntegral ( ((c .&. 0b10000000) `shiftR` 7)
.&. (1 - ((c .&. 0b01000000) `shiftR` 6))
)
```

Here we rely on the following property of the UTF-8 encoding:
each character has one and only one byte that doesn’t follow the `10xxxxxx`

bit pattern.
In other words, we don’t have to properly decode UTF-8 just for the sake of characters count.

What about the max line length? Here all the complexity is about properly accounting the non-printable characters and mimicking tabulation behaviour:

```
instance Statistic 'MaxLL (Tagged 'MaxLL) MaxLLState 'ByteOnly where
initState = MaxLLState 0 0
extractState MaxLLState { .. } = Tagged $ max maxLen curLen
computation = ByteOnlyComputation step
where
step MaxLLState { .. } 9 = MaxLLState maxLen $ curLen + 8 - (curLen `rem` 8)
step MaxLLState { .. } 8 = MaxLLState maxLen $ max 0 (curLen - 1)
step MaxLLState { .. } c | c == 10
|| c == 12
|| c == 13 = MaxLLState (max maxLen curLen) 0
| c < 32 = MaxLLState maxLen curLen
step MaxLLState { .. } _ = MaxLLState maxLen (curLen + 1)
```

`wc`

doesn’t take into account.
Looks a bit messy, but what you’re gonna do?
Having implemented all the base statistics, we now turn to the fun part: how do we combine them together?

### Combining stats

If `a`

is a statistic and `b`

is a statistic,
then the pair of `(a, b)`

is also a statistic.
So let’s start with the type for a pair of statistics:

```
infixr 5 :::
data a ::: b = a ::: b deriving (Show)
```

We could have just as well used the standard Prelude pair (as in `(,)`

),
but this way we don’t rely on the compiler eliding laziness (since tuples are lazy),
and a custom type for statistics also makes the type-promoted code a bit more readable IMO.

Let’s now express how the statistics are combined.

First, what about the combination of chunked vs byte-only statistics? If both statistics support chunked computations, then the result can also do chunked stats, otherwise it should degrade to byte-only computation. The type-level expression of this idea would be

```
type family CombineCompTy a b where
CombineCompTy 'Chunked 'Chunked = 'Chunked
CombineCompTy _ _ = 'ByteOnly
```

How would an instance of `Statistic`

look like for a pair of statistics?
We might write something like this:

```
instance (Statistic sa resa sta compa, Statistic sb resb stb compb)
=> Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) (CombineCompTy compa compb) where
initState = initState ::: initState
extractState (a ::: b) = extractState a ::: extractState b
computation =
case (computation :: StatComputation sta compa, computation :: StatComputation stb compb) of
(ByteOnlyComputation a, ChunkedComputation b _) -> ByteOnlyComputation $ combine a b
(ChunkedComputation a _, ByteOnlyComputation b) -> ByteOnlyComputation $ combine a b
(ByteOnlyComputation a, ByteOnlyComputation b) -> ByteOnlyComputation $ combine a b
(ChunkedComputation stepA chunkA, ChunkedComputation stepB chunkB)
-> ChunkedComputation (combine stepA stepB) (combine chunkA chunkB)
where
combine fa fb = \(a ::: b) w -> fa a w ::: fb b w
```

So if `sa`

is a statistic with the result type `resa`

,
state type `sta`

and computation mode `compa`

,
and similarly for `sb`

/ `resb`

/ `stb`

/ `compb`

,
then the pair `sa ::: sb`

is a statistic with the result type being a pair `resa ::: resb`

,
state type being `sta ::: stb`

and the computation mode — you guessed it — `CombineCompTy compa compb`

.

Note the (ticked, promoted) `:::`

-as-a-constructor vs (unticked, unpromoted) `:::`

-as-a-type
in the instance head.
`sa`

and `sb`

are *terms* promoted to the type level,
so we use the (promoted) *term* constructor near them,
and the rest are the usual *types* (living in `Type`

),
so we use the (unpromoted) *type* constructor there.

All is good, except… We can’t write this instance.
Having a type family in the instance head in current Haskell is illegal.
The easy way to fix this is to introduce a fresh type variable `comp`

and
add the constraint that it’s equal to the result of the type family application:

```
instance (Statistic sa resa sta compa,
Statistic sb resb stb compb,
comp ~ CombineCompTy compa compb)
=> Statistic (sa '::: sb) (resa ::: resb) (sta ::: stb) comp where
```

The rest of the instance definition remains the same.

Looking at the actual term-level stuff, the first two methods are easy and straightforward:

- Initial state of a pair of statistics is a pair of the corresponding initial states.
- Extracting state of a pair of statistics is a pair of the extracted states of the individual statistics.

Note we don’t need a single type annotation here!

The third method for building the statistics computation for the pair
is somewhat more boilerplate-ish,
but it largely follows the definition of the `CombineCompTy`

type
(and the type checker ensures that it indeed does follow that definition).
Namely, if both computations support chunked stats,
the resulting statistic also supports those (that’s the last clause).
Otherwise we fall back to the byte-only computation.

The major part of the boilerplate comes from the need to
explicitly pattern-match on at least one of the computers.
Why?
The type checker needs to ensure that the right-hand side
of each `case`

branch has the right type.

What’s the expected right-hand side type?
According to the class definition, it’s `StatComputation st comp`

,
where, according to the instance definition, `comp ~ CombineCompTy compa compb`

.
So it *depends* on the particular `compa`

and `compb`

,
and in order to reduce the `CombineCompTy`

type family,
the type checker needs to know whether both `compa`

and `compb`

are
`Chunked`

or not.

How does the type checker know the value of `compa`

or `compb`

?
In general it doesn’t — unless we pattern-match on the values of the corresponding `computation`

s.
Then the GADT machinery would tell us — by the definition of `StatComputation`

,
if the value is constructed with the `ChunkedComputation`

constructor,
then the corresponding type variable is necessarily `Chunked`

,
otherwise, if it’s the `ByteOnlyComputation`

constructor,
then the value is necessarily `ByteOnly`

.

Note that if we wrote `CombineCompTy`

without wildcard patterns,
enumerating all four possibilities, then the type checker would have had to know
the values of *both* `compa`

and `compb`

.

One other thing worth noting is that this representation of combined stats
allows having duplicates:
`'Words '::: 'Words`

represents a combined statistic of two word counts.
This would probably be frowned upon by ⊤rue type safety zealots,
but it is good enough for our purposes, and, moreover,
this particular feature will come in handy quite soon.

## Using stats

Great, we wrote a bunch of code. How do we actually use it?

Suppose we’re given an type satisfying the `Statistic`

class
and a `ByteString`

to compute statistics over,
We first unwrap the `computation`

of said instance.
If it’s a `ChunkedComputation`

, we run it on the whole input string.
Otherwise it’s a `ByteOnlyComputation`

,
and we `BS.foldl'`

over the input byte-by-byte.

Or, in code:

```
wc :: forall s res st comp. Statistic s res st comp => BS.ByteString -> res
wc s = extractState $! runCompute computation
where
runCompute :: StatComputation st comp -> st
runCompute (ByteOnlyComputation step) = BS.foldl' step initState s
runCompute (ChunkedComputation _ chunker) = chunker initState s
```

Fundeps save the day again here,
since the type checker is able to figure out all of the class arguments
just by looking at `res`

.
On the other hand, unfortunately, the type checker seemingly isn’t able
to infer the type of `runCompute`

, so it has to be specified explicitly.
Thus the variables `st`

and `comp`

in its signature
have to be the same ones as in the top-level `wc`

type,
hence the `forall`

keyword along with the `ScopedTypeVariables`

extension.

So, having this `wc`

function, we select the statistics we want to compute
either by an explicit (return) type annotation:

`let result = wc someBS :: Tagged 'Words ::: Tagged 'Lines`

or using the `TypeApplications`

extension
and explicitly giving the value of the `s`

type variable in its signature:

`let result = wc @('Words '::: 'Lines) someBS`

Both invocations are equivalent, but the latter is a bit more appealing to me, as, IMO, it communicates the intent a bit better.

### Performance, preliminarily

So, do all our efforts really pay off, or did we just do all this type classes mumbo-jumbo in vain?

Let’s measure how long does `wc @'Words`

take, using the same benchmarking strategy as the last time.
The best running time is 1.51 seconds — a bit more than counting bytes, words and lines in the unmodularized version.

How good is the compiler at eliding unnecessary computations with a representation like this?
Let’s measure `wc @('Words '::: 'Words)`

to find out!

But here the weirdness begins.
I expected to see something that’s maybe a bit slower,
ideally something that runs in pretty much the same time, but…
It actually runs *faster*, finishing in 1.34 seconds.
And `wc @('Words '::: 'Words '::: 'Words)`

takes about 1.30 seconds.
Adding more `Words`

doesn’t help further, though.

There’s something even weirder.
I’ve discussed this with some folks on the `#haskell`

freenode channel
and, in particular, I’ve asked if they were able to reproduce this behaviour.
They weren’t.
The results were pretty consistent and independent on the number of duplicates of the same computation.

I don’t have a good explanation for this. I spent some time looking at GHC Core — to no avail. If it was just the performance improvement from “more work to be done”, then I’d probably be able to propose a few hypotheses about the behaviour of the inliner, or the specializer, or something along those lines. But, given that it doesn’t reproduce on at least some of the others’ machines… I don’t know. And I can’t say I like this.

Anyway, enough for the ramblings, let’s proceed with our quick and dirty performance measurements.
What about all the three statistics we had previously?
Let’s measure `wc @('Bytes '::: 'Words '::: 'Lines)`

!

The run time of that one is 1.53 seconds. It’s a bit worse than 1.45 seconds we had previously with all three statistics hardcoded into a single tight loop, but, in my opinion, this is still a good result. So we’ve empirically proven that our approach stands a chance. Let’s see what benefits we can reap due to this modularizaton.

## Testing

Oh, testing this is a pleasure! Local reasoning enables us to test each individual statistic in isolation and still be confident all of them work correctly in any combination.

For instance, we can check that, for any string,
the result of the `Words`

statistic on that string
is equal to the length of the list returned by the standard `words`

function
applied to the same string, ignoring all spaces at codepoints above `127`

.
In fact, we can postulate this is the *defining property* of the `Words`

statistic,
but that’s more of a question of philosophy and attitude.

In code, we’ll just have some QuickCheck-style properties written for both ASCII and UTF-8 inputs:

```
import qualified Data.ByteString.Char8 as BS
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
-- and some testing framework imports
import Data.WordCount
wrapUnicode :: UnicodeString -> (BS.ByteString, T.Text)
wrapUnicode ustr = (T.encodeUtf8 txt, txt)
where
txt = T.pack $ getUnicodeString ustr
replaceNonAsciiSpaces :: Char -> Char
replaceNonAsciiSpaces ch | ch >= chr 127 && isSpace ch = '_'
| otherwise = ch
main :: IO ()
main = hspec $ parallel $ modifyMaxSuccess (const 10000) $ modifyMaxSize (const 1000) $ do
describe "ASCII support" $ do
it "Counts bytes correctly" $ property $
\(getASCIIString -> str) -> wc @'Bytes (BS.pack str) `shouldBe` genericLength str
it "Counts chars correctly" $ property $
\(getASCIIString -> str) -> wc @'Chars (BS.pack str) `shouldBe` genericLength str
it "Counts words correctly" $ property $
\(getASCIIString -> str) -> wc @'Words (BS.pack str) `shouldBe` genericLength (words str)
it "Counts lines correctly" $ property $
\(getASCIIString -> str) -> wc @'Lines (BS.pack str) `shouldBe` genericLength (filter (== '\n') str)
describe "UTF8 support" $ do
it "Counts bytes correctly" $ property $
\(wrapUnicode -> (bs, _)) -> wc @'Bytes bs `shouldBe` fromIntegral (BS.length bs)
it "Counts chars correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Chars bs `shouldBe` fromIntegral (T.length txt)
it "Counts words correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Words bs `shouldBe` genericLength (T.words $ T.map replaceNonAsciiSpaces txt)
it "Counts lines correctly" $ property $
\(wrapUnicode -> (bs, txt)) -> wc @'Lines bs `shouldBe` fromIntegral (T.count "\n" txt)
```

And that’s about it for the tests!

A couple of things worth mentioning:

- We don’t check any properties for the max line length statistic since it’s… non-trivial.
- We do some
`Data.Text`

-related string handling, but that’s just probably the easiest way to get a`ByteString`

containing UTF-8 bytes. - The tests also execute reasonably fast: it takes about 3-5 seconds to get through 10000 samples having max string length of 1000 characters on my machine for each property.

Anyway, the reader is now invited to try implementing something similar for the GNU Coreutils C version.

## Handling user options

We’ll use the optparse-applicative library to do command-line options parsing. We introduce a type for the command-line options along with a parser for it:

```
data Options = Options
{ countBytes :: Bool
, countChars :: Bool
, countLines :: Bool
, countMaxLineLength :: Bool
, countWords :: Bool
, files :: [FilePath]
}
options :: Parser Options
options = Options
<$> switch (long "bytes" <> short 'c' <> help "print the byte counts")
<*> switch (long "chars" <> short 'm' <> help "print the character counts")
<*> switch (long "lines" <> short 'l' <> help "print the newline counts")
<*> switch (long "max-line-length" <> short 'L' <> help "print the maximum display width")
<*> switch (long "words" <> short 'w' <> help "print the word counts")
<*> some (argument str (metavar "FILES..."))
```

We start our `main`

with actually running that parser
and mapping the options onto the values of the `Statistics`

type we had previously,
counting bytes, words and lines if the user hasn’t selected anything explicitly:

```
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
, (countWords, Words), (countMaxLineLength, MaxLL)
, (countLines, Lines)
]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
```

All of this can likely be done way more elegantly, but this is the most straightforward approach
that gets the boring parts out of the way, giving us the `stats`

list to work with.

Given that list, what do we do?
We need to convert it to a type that we could feed to `wc`

.
So, in other words, we have a term, and we need to treat it as a type.
Sounds exactly like dependent types!

### Almost dependent types

This approach is doomed to fail performance-wise, but it’s still worth giving it a shot to see on a real example how we can simulate something dependently type-ish in modern Haskell (and just how bad will it be). Let’s go ahead and build the solution step by step, preferably avoiding things like singletons to get a better feel of the underlying type system mechanics.

We don’t know what statistics we’ll compute at run time, so we’re wrapping it in an existential type:

```
data SomeStats where
MkSomeStats :: Statistic s res st comp => proxy s -> SomeStats
```

Here `proxy s`

is basically a witness of a specific `Statistic`

instance.
Its sole purpose is to ensure we have a type around to get a hold on.

Given an value of this existential type, how do we actually use it? We might try something like

```
wc' :: SomeStats -> BS.ByteString -> ?
wc' (MkSomeStats (_ :: proxy s)) input = wc @s input
```

but… What type do we write instead of the `?`

?
What’s the return type of this function?
It’s the `res`

type from the corresponding `Statistic`

instance,
but we don’t know what exactly it is!

What if we modify our existential to keep track of it as well, like this?

```
data SomeStats where
MkSomeStats :: Statistic s res st comp => proxy1 s -> proxy2 res -> SomeStats
```

and pattern-match like

```
wc' :: SomeStats -> BS.ByteString -> res
wc' (MkSomeStats (_ :: proxy1 s) (_ :: proxy2 res)) input = wc @s input
```

This obviously won’t work even syntactically: `res`

is not in scope in the type signature.
And even if it somehow was magically syntactically correct,
then `res`

would escape the existential — something a reasonable type system should prevent.

What do we do?

Let’s take a step back.
We indeed don’t know the exact return type of `wc`

,
but the insight is that we don’t really need to!
We only care about, say, printing it to the user.
So, in other words, we only care if it’s able to, say, `show`

this result.
Let’s just write this down!

```
data SomeStats where
MkSomeStats :: (Statistic s res st comp, Show res) => proxy s -> SomeStats
```

Now we can write our `wc'`

as

```
wc' :: SomeStats -> BS.ByteString -> String
wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input
```

and this type checks.

Now, how do we actually convert our input of `stats`

to `SomeStats`

?

First, we need to be able to promote individual statistics (members of the `stats`

list):

```
promoteStat :: Statistics -> SomeStats
promoteStat Bytes = MkSomeStats (Proxy :: Proxy 'Bytes)
promoteStat Chars = MkSomeStats (Proxy :: Proxy 'Chars)
promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)
promoteStat MaxLL = MkSomeStats (Proxy :: Proxy 'MaxLL)
promoteStat Lines = MkSomeStats (Proxy :: Proxy 'Lines)
```

It’s a bit ugly that we have to explicitly enumerate all the values of `Statistics`

,
but this is the limitation of the Haskell type system.
So much for the ersatz-dependent types:
even though, say, the *term* `Bytes`

and the (promoted) *type* `'Bytes`

look the same, they are very different entities to the type checker
with no a-priori relationship between them,
and we need to establish that relationship manually by the means of a function like `promoteStat`

.

Anyway, given this `promoteStat`

we can write a function to promote a whole list of stats:

```
promoteStats :: [Statistics] -> SomeStats
promoteStats [s] = promoteStat s
promoteStats (s:ss) =
case (promoteStat s, promoteStats ss) of
(MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst))
-> MkSomeStats (Proxy :: Proxy (st '::: sst))
```

Promoting a single element list boils down to using `promoteStat`

.

Promoting a list with two or more elements is more interesting and might seem a bit magic.
We use `promoteStat`

and `promoteStats`

to promote the head and the tail of the list respectively.
Then we pattern-match on the results of these promotions,
and we bind `st`

to the (unknown) type which is the result of promoting the head of the list,
and we bind `sst`

to the (unknown) type which is the result of promoting the tail.

We don’t really know anything about those types except they implement `Statistic`

(because that’s the constraint in the existential type).
But if they do implement `Statistic`

,
then `st ::: sst`

also implements `Statistic`

—
that’s precisely the gist of the combining instance we’ve written above!
Moreover, we also know that `rest`

and `resst`

(that is, some imaginary unnamed type variables corresponding to the statistics computations result types)
implement `Show`

.
So we can derive that `rest ::: resst`

also implements `Show`

,
and this is precisely the result type of `st ::: sst`

.
So, all in all, `MkSomeStats (Proxy :: Proxy (st '::: sst))`

is perfectly well-typed.

Indeed, it might seem a bit marvelous that the type checker is able to figure all this out!

Also, this is a partial function (we don’t handle `[]`

), but let’s keep the exposition simple.
It’d be a bit more involved with `NonEmpty`

, blurring the focus somewhat.

Anyway, given this function, the usage is pretty straightforward:

```
main :: IO ()
main = do
-- obtaining `stats` as before
forM_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ wc' (promoteStats stats) contents
```

#### Performance, wonders

How well does this perform? Let’s run again to find out!

Running the lines calculation looks promising, taking 1.05 seconds to complete.
That’s about as fast as doing `BS.count 10 contents`

.

But that’s a chunked stat that takes the whole input at once in this case. What about something that should go over the input byte-by-byte, like the word counter? That’s gonna be… 14 seconds, as opposed to circa 1.5 we had previously.

Gosh, **14 seconds**.

And it allocates like crazy: if we run the program with `+RTS -sstderr`

, we’ll see

` 74,873,139,008 bytes allocated in the heap`

I haven’t systematically measured the heap allocations for the previous versions,
but it always was under 1 megabyte.
Well, at least this version is still O(1) on memory —
most of those allocations just immediately die in the nursery
(`60,512 bytes maximum residency`

).

What if we count words and lines? **27 seconds**, 120 gigs allocated in the heap.

What about words, lines and bytes? Can you guess?

If you expected something around 42 seconds, please pat yourself on the back —
it runs in 41 seconds and allocates 194 gigs.
Maximum residency as per RTS is still about 60 kilobytes, though —
the allocated bytes count is the *total* memory ever allocated,
it is *not* the peak memory consumed.

Why is it so bad?

Well, when there is a function like

`wc' (MkSomeStats (_ :: proxy s)) input = show $ wc @s input`

then the *compiler* can’t possibly know what `computation`

is associated with the type variable `s`

—
it’s only really known at *run* time, once the user makes their choice.
Thus the compiler cannot inline the specific `computation`

for that `s`

—
in fact, it has no choice but to compile this down to explicit
`Statistic`

class methods dictionary passing without any inlining at all.

So, what happens here?

The`wc'`

function receives a pointer to the `computation`

function
wrapped inside the `SomeStats`

existential
(along with the pointers to the rest of `Statistic`

methods),
and passes it on to `wc`

, which calls that function through the pointer for every input byte.
No inlining, no related optimizations, no tight loops, but a function call on each iteration.
So that’s about 1.8 billion of calls for our input data, which is surely gonna be slow!
And the caller is responsible for packing the right function pointers into the existential.

For instance, when the `stats`

list consists of a single element,
so the

`promoteStats [s] = promoteStat s`

equation kicks in, and `s`

turns out to be, for instance, `Words`

,
`promoteStat`

gets evaluated according to the

`promoteStat Words = MkSomeStats (Proxy :: Proxy 'Words)`

equation.
So it wraps the pointers to the methods
of the `Statistic`

implementation for `Words`

in the `SomeStats`

existential.

The other `promoteStats`

equation,

```
promoteStats (s:ss) =
case (promoteStat s, promoteStats ss) of
(MkSomeStats (_ :: proxy1 st), MkSomeStats (_ :: proxy2 sst)) -> MkSomeStats (Proxy :: Proxy (st '::: sst))
```

is a bit more involved:
for example, the pointer for `computation`

would be built roughly as follows.
The `case`

on the right-hand side of the equation
extracts the pointers to the `computation`

implementations
from the existentials returned by `promoteStat`

and the recursive call to `promoteStats`

and, to simplify it,
passes them on to the `computation`

of the implementation
of the `Statistic`

class for the “inductive” `sa ::: sb`

case.
That last function just invokes those earlier pointers one after the other.

So, if the options list consists of two stats, we’re paying the overhead of about 13-14 seconds twice, and the total run time should be about 28 seconds — just as we’ve observed. Shall we desire all five statistics, the run time would be around 65-70 seconds.

Understanding how existentials (and type erasure in general) work leads us to expect this approach to have terrible performance.

### The ugly way

But all this helps us to narrow the possible solutions space down.
We need to make sure the compiler sees the exact (monomorphized) type of `wc`

at each invocation point.
Ideally, we’d have something like

```
main = do
-- ..
case stats of
[Words] -> print $ wc @'Words contents
[Bytes] -> print $ wc @'Bytes contents
[Lines] -> print $ wc @'Lines contents
[Words, Bytes] -> print $ wc @('Words '::: 'Bytes) contents
[Lines, Bytes] -> print $ wc @('Lines '::: 'Bytes) contents
-- ...
```

and so on.
But if we have just five statistics we might compute, then basic combinatorics shows that
we’ll have 2^{5} - 1 = 31 `case`

branches.
Obviously, that’s not gonna work.

That is, unless the compiler does the job for us.
For this, we’ll need to resort to some Template Haskell.
We’ll write a (meta-)function `dispatch`

that will be used as follows:

```
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
```

Here the `$(dispatch 'wc 'contents)`

splice creates a function
doing a `case`

analysis of the `stats`

argument
using the generated list of alternatives
and invoking the `wc`

function as needed.

This is a pretty technical exercise in Template Haskell, so I’m not describing the thought process behind it. Here’s the result of that process:

```
dispatch :: Name -> Name -> Q Exp
dispatch fun bs = reify ''Statistics >>= \case
TyConI (DataD _ _ _ _ cons _) -> do
let consNames = [ name | NormalC name _ <- cons ]
let powerset = filterM (const [True, False]) consNames
let matches = buildMatch fun bs <$> filter (not . null) powerset
fallbackMatch <- (\body -> Match WildP (NormalB body) []) <$> [e| error "Unexpected input" |]
pure $ LamCaseE $ matches <> [fallbackMatch]
_ -> fail "unsupported type"
buildMatch :: Name -> Name -> [Name] -> Match
buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'show `AppE` (wcCall `AppE` VarE bs)) []
where
wcCall = VarE fun `AppTypeE` foldr1 f (PromotedT <$> consNames)
f accTy promotedTy = PromotedT '(:::) `AppT` accTy `AppT` promotedTy
```

Long story short, we figure out all the constructors of the `Statistics`

type,
and for each non-empty subset of the constructor set — that is, for each non-empty element in `powerset`

—
we build a separate `case`

branch using the helper `buildMatch`

function.
Then we wrap all those alternatives into a single lambda case
using the `{-# LANGUAGE LambdaCase #-}`

extension.

The only other thing that makes me cringe a little (besides the need for TH)
is that the generated function expects the input `stats`

list to be sorted.
We can discharge this expectation either by carefully choosing the order of the elements in this list:

```
let selectedStats = map snd $ filter fst [ (countBytes, Bytes), (countChars, Chars)
, (countWords, Words), (countMaxLineLength, MaxLL)
, (countLines, Lines)
]
```

where they are already given in the order induced by the compiler-generated `Ord`

instance,
or by explicitly sorting the `stats`

list.
Both approaches work, and the performance penalty of sorting the `stats`

would obviously be negligible.

So, how would this function be used in context? Let’s put all of the pieces together!

```
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
forM_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
```

#### Performance, redux

How well does this approach perform?
I’m not going to benchmark all 31 combinations, but just some random subset of those.
Measuring is done as before: each test is run 5 times on a sample file of 1.8 gigabytes,
and the minimum userspace time is taken.
GNU Coreutils’ `wc`

is run with environment variables `LC_ALL=C LANG=C`

unless otherwise noted.

Here are the results:

Words | Bytes | Lines | Characters | Max line length | Haskell `wc` , s |
Coreutils `wc` , s |
---|---|---|---|---|---|---|

✓ | 0.00 ¹ | 0.00 ¹ | ||||

✓ | 1.54 | 12.5 | ||||

✓ | ✓ | 1.20 | 12.5 | |||

✓ | 1.06 / 0.24 ² | 0.26 | ||||

✓ | ✓ | ✓ | 1.52 | 12.5 | ||

✓ | 1.42 | 8.45 ³ | ||||

✓ | 2.21 | 12.5 | ||||

✓ | ✓ | 2.92 | 12.5 |

A few observations and footnotes:

- It still boggles my mind that counting bytes
*and*words is faster than just counting words. - Looks like the C version always computes some basic set of statistics (words, bytes, lines, max line length) if either words or max line length option is enabled. I’m not surprised.
- ¹ Yes,
`time`

literally shows`0.00user`

for both programs counting bytes. - ² The first Haskell number (1.06 seconds) for the bytes and lines test is with the upstream
`bytestring`

. The second number is with a patched`bytestring`

, where the`count`

method is optimized. Ironically,`count`

is implemented in C, and, ironically again, the C compiler is unable to optimize it reasonably well, and even a pure Haskell implementation would be faster, but all this irony would make a good subject for another post. The patch boils down to replacing that method with a bunch of~~Assembly~~intrinsics. With that patch, we’re basically comparing the speed of two implementations in a hand-written Assembly (or intrinsics, or whatever). No wonder it’s very close. - ³ Here
`wc`

is run with an UTF-8 locale, as otherwise it figures out that the characters count is equal to the bytes count and runs in constant time, while the goal is to compare UTF-8 character counting.

## Nits

Our little program can handle user options,
it supports all of the statistics that `wc`

supports,
it can handle multiple files, but what is missing?

### Parallelism

This is easy: we can just replace `forM_`

with `forConcurrently_`

from the
async
library in our main loop, resulting in:

```
main :: IO ()
main = do
Options { .. } <- execParser $ info (options <**> helper) (fullDesc <> progDesc "Print newline, word, and byte counts for each file")
let selectedStats = map snd $ filter fst [(countBytes, Bytes), (countChars, Chars), (countWords, Words), (countMaxLineLength, MaxLL), (countLines, Lines)]
let stats | null selectedStats = [Bytes, Words, Lines]
| otherwise = selectedStats
forConcurrently_ files $ \path -> do
contents <- unsafeMMapFile path
putStrLn $ $(dispatch 'wc 'contents) stats
```

This will handle as many files simultaneously as there are cores —
or, rather, the value of the `-N`

RTS option, which
is equal to the number of cores by default for our compiled program.
Alternatively, we could add another option, say, `-j`

for jobs count.

What’s the overhead of this parallelism?

If we count, say, just words and bytes of a single file, it’ll run in about 1.22 seconds — almost as the sequential version, although the variance of the run time is higher in this case.

If we now feed our program with 6 copies of our test file (my CPU has 6 physical cores), it runs in 1.47 seconds — but with the variance even higher in this case. My guess is that it is already effectively bottlenecked on memory bandwidth at this point.

### Pretty-printing

Now if we run our program, the result is presented as

`Tagged 123 ::: (Tagged 456 ::: Tagged 789)`

That’s not really friendly. Let’s add some pretty-printing!
The easiest way is to add a method to the `Statistic`

class:

```
class Statistic s res st comp | res -> s, st -> s
, s -> res, s -> st, s -> comp where
-- ...
prettyPrint :: res -> String
```

The implementation for base statistics is simple, for instance:

```
instance Statistic 'Bytes (Tagged 'Bytes) (Tagged 'Bytes) 'Chunked where
-- ...
prettyPrint (Tagged n) = show n <> " bytes"
```

The one for the combined statistics is just a bit more typing:

` prettyPrint (a ::: b) = prettyPrint a <> "\n" <> prettyPrint b`

We also change our `buildMatch`

helper to use `prettyPrint`

:

`buildMatch fun bs consNames = Match (ListP $ (`ConP` []) <$> consNames) (NormalB $ VarE 'prettyPrint `AppE` (wcCall `AppE` VarE bs)) []`

And that’s it!

### More input types

Our implementation unconditionally interprets input paths as files and `mmap`

s them.
This is not always going to work:
for instance, it’s impossible to `mmap`

a `/proc/fd`

file descriptor
in a case like `hwc <(cat foo | grep bar)`

.

One way to remedy this is to check if the input is a regular file or a symbolic link —
those can be `mmap`

’ed, and we already know how to handle them.

We choose to be safe and just unconditionally avoid `mmap`

for everything else.
Any other input gets read into a *lazy* `ByteString`

so that the space complexity is still O(1).
A lazy `ByteString`

is basically a list of strict chunks,
so we can just fold over those in case of chunked computations:

```
import qualified Data.ByteString.Lazy as BSL
wcLazy :: forall s res st comp. Statistic s res st comp => BSL.ByteString -> res
wcLazy s = extractState $! runCompute computation
where
runCompute :: StatComputation st comp -> st
runCompute (ByteOnlyComputation step) = BSL.foldl' step initState s
runCompute (ChunkedComputation _ chunker) = BSL.foldlChunks chunker initState s
```

Alternatively,
we could have introduced a class of the input types and overload `wc`

on that one,
but the structure of the problem implies we’ll only going to have either strict or lazy
`ByteString`

s, so this (more straightforward) approach works too,
and it results in more or less equal amount of boilerplate.

Having this function, we modify our `main`

as follows:

```
forConcurrently_ files $ \path -> do
stat <- getFileStatus path
if isRegularFile stat || isSymbolicLink stat
then countStrict stats $ unsafeMMapFile path
else countLazy stats $ BSL.readFile path
```

where we’ve added a couple of tiny helper functions:

```
where
countStrict stats act = do
contents <- act
putStrLn $ $(dispatch 'wc 'contents) stats
countLazy stats act = do
contents <- act
putStrLn $ $(dispatch 'wcLazy 'contents) stats
```

And that’s it!

### Handling stdin

One other thing that Coreutils `wc`

does and our implementation doesn’t
is computing statistics over `stdin`

.

The fix is quite simple, just adding a check to the end of our `main`

:

```
main = do
-- ... as before ...
when (null files) $ countLazy stats BSL.getContents
```

If the `files`

list is empty,
then the `forM_`

/`forConcurrently_`

fold doesn’t do anything,
so we shouldn’t even guard it.

Now we can do something like this:

`cat testfile.txt | /usr/bin/time hwc-exe -cw`

What’s the performance hit?
The above command runs in 1.40 seconds — that’s quite a jump from 1.22 seconds we had previously,
but looks like it’s the overhead of avoiding `mmap`

and using lazy `ByteString`

s.

### Compile times and code bloat

I’m really nitpicking here, but let’s mention this too for the sake of some objectivity.

What’s the overhead of all this metaprogramming nonsense? We can measure either compilation times or the code size. So let’s do both!

If we go the “dependently-typed” route, which we can consider a baseline in this case,
`stack build`

takes about 7.9 seconds for the whole project,
and the resulting binary size is 2.24 megabytes (after stripping).

The Template Haskell approach takes exactly three times as much to compile — 23 seconds. The code size change isn’t that big, though: it’s increased by just about 4.3%, resulting in a binary of 2.34 megabytes.

Although… It isn’t worth it doing `stack build`

during normal development workflow:
`stack build --fast`

might be much faster.

Again, the “dependently-typed” implementation takes about 5.6 seconds to compile with `--fast`

,
and the TH one takes… 7.8 seconds.
Not bad. I can buy that.

And, since this is a very important factor for some, let’s compare this to C. As expected, Haskell loses a lot here. I won’t even benchmark the compilation time of the C version — I bet it’s less than 0.1 seconds in release mode. Regarding code bloat, the resulting binary size is 51 kilobytes on my machine or 47 kilobytes in Ubuntu packages.

Anyway, I think that’s about it for all the minor things we could have done.

And this is already turns out to be quite a lengthy post, way more than I originally had in mind, so let’s jump to

## Conclusions

So, what have we achieved?
In a couple of words, we’ve managed to productionize our initial `wc`

prototype
to support user options (and more statistics!)
in such a way that it doesn’t compute what the user hasn’t requested,
all while ensuring the performance overhead of this modularization is negligible.
In a sense, this implementation is closer to the favourite principle of the C and C++ programmers —
“don’t pay for what you don’t use” — than the C version.

Or, in more detail, we’ve:

- split our initial
`wc`

prototype into a set of small, testable and reusable pieces — one per each individual statistic; - learned how to easily compose those pieces together;
- written some tests for the individual statistics while being certain any combination of statistics is computed correctly by construction;
- tried computing just the statistics chosen by the user at runtime via some techniques inspired by dependent types, and watched them fail hilariously (much to our expectation);
- managed to overcome the performance issues via some Template Haskell;
- verified the result is very close performance-wise to our initial implementation, so all this modularity does not have a significant overhead;
- closed a few other gaps our implementation had, ranging from parallel input processing to handling standard input.

I think that’s enough for today.

Oh, and by the way, the source code of the resulting program (modulo a couple of nits) is also available.

And there are also a couple of things worth mentioning:

- The dependently typed approach isn’t necessarily doomed performance-wise. We chose to feed each byte to each handler, but we could instead feed a L1-cache-sized chunk of data to each handler. Then, for instance, since on my machine L1d cache size is 32 KiB and the input file is 1.8 gigs, the overhead would have been about 0.4 ms, or 0.04% of the characteristic run time of 1 second, which is totally reasonable.
- Our implementation still isn’t completely equivalent to
`wc`

, since, for instance, the char count only supports ASCII and UTF-8 encodings. Although I haven’t seen a machine with a different encoding in maybe 18 years after I played around with a KOI8-R-based distro, support for other locales encodings can certainly be useful. Or, as an another example, this implementation counts words slightly differently than`wc`

, which could also be an important difference in some applications.

The nice part about this approach, though, is that it’s super easy and super cheap to add more statistics, implementing different algorithms, without affecting the existing ones. Modularity FTW!

## Edit history

*Apr 05*: expanded the conclusion and fixed a few typos in the text where `Statistics`

occurred instead of `Statistic`

.