• You are not logged in.

    Letter frequency analysis

    • Started by stevep99
    • 3 Replies:
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    I have been analyzing character frequencies from some different sources, to see how much it varies from the "standard" set of books I've been using. The answer: more than I was expecting.

    You can see the results on this page.

    These frequency tables can also be used in my layout analyzer, the links to do this are at the bottom of the analyzer main page.

    Last edited by stevep99 (29-Feb-2020 17:37:48)

    Using Colemak-DH with Seniply.

    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,361

    Oh wow, holy underscore Batman!  (ʘ_ʘ;)

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Offline
    • 0
    • Shai
    • Administrator
    • Reputation: 36
    • Registered: 11-Dec-2005
    • Posts: 423

    For these kind of things, you want to get a large and diverse corpus.

    I recommend English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU which has statistics based on a corpus of trillions of characters. The source is just books, but a very wide variety of them, and the ngram dataset is downloadable for free.

    I've added a list of corpora resources on the Design page.

    Offline
    • 0
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    Yeah, Peter Norvig's analysis is really useful, and the size of the corpus he is using is huge. I have tried using his data before. The one drawback though, is he is focussed on letters only, not other symbols such as punctuation characters etc, which for our purposes is not ideal.

    Using Colemak-DH with Seniply.

    Offline
    • 0