Letter frequency analysis

Started by stevep99
3 Replies:

stevep99
Member

Reputation: 118
From: UK
Registered: 14-Apr-2014
Posts: 980

Website

29-Feb-2020 17:37:10#1 Letter frequency analysis

I have been analyzing character frequencies from some different sources, to see how much it varies from the "standard" set of books I've been using. The answer: more than I was expecting.

You can see the results on this page.

These frequency tables can also be used in my layout analyzer, the links to do this are at the bottom of the analyzer main page.

Last edited by stevep99 (29-Feb-2020 17:37:48)

Using Colemak-DH with Seniply.

Offline

DreymaR
Member

Reputation: 220
From: Viken, Norway
Registered: 13-Dec-2006
Posts: 5,401

Website

02-Mar-2020 10:13:19#2 Re: Letter frequency analysis

Oh wow, holy underscore Batman! (ʘ_ʘ;)

*** Learn Colemak in 2–5 steps with Tarmak! ***
*** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

Offline

Shai
Administrator

Reputation: 37
Registered: 11-Dec-2005
Posts: 423

03-Apr-2020 21:56:42#3 Re: Letter frequency analysis

For these kind of things, you want to get a large and diverse corpus.

I recommend English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU which has statistics based on a corpus of trillions of characters. The source is just books, but a very wide variety of them, and the ngram dataset is downloadable for free.

I've added a list of corpora resources on the Design page.

Offline

stevep99
Member

Reputation: 118
From: UK
Registered: 14-Apr-2014
Posts: 980

Website

04-Apr-2020 11:03:21#4 Re: Letter frequency analysis

Yeah, Peter Norvig's analysis is really useful, and the size of the corpus he is using is huge. I have tried using his data before. The one drawback though, is he is focussed on letters only, not other symbols such as punctuation characters etc, which for our purposes is not ideal.

Using Colemak-DH with Seniply.

Offline

Index

›

User contributions

›