You may have missed it, but I've created a corpus which is much larger, cleaner and varied.
You can download the corpus here
FWIW, last year in September I compiled an English corpus and a code corpus. I posted some analysis to Den's site before it disappeared, along with my breathless prose.
Anyway, Arno's efforts motivated me to finally write it all up and do some other analysis, which is published here, along with some useful files. Get the 1.0.1 and -101 versions.
During the process I wrote Shakespeare's Monkey and Shakespeare's Coder ... little programs that use the English and code bigram frequencies (all 97 characters on ANSI) to produce "English" or "code" that things like KLA think are perfect English or code, even though we can see they're not.
But they work great in KLA, and certainly better than the likes of Alice. It also effectively solves "how to test typing code" since the code samples are multi-language.
They should also be suitable fodder for analysis engines working with bigrams, since they are nothing but chained bigrams.
There's a whole bunch of "useful" spreadsheets in the zip file. Enjoy. :-)
I would not recommend for cryptonanalysis, but good for keyboard layouts.