• You are not logged in.
  • Index
  • General
  • English Letter Freq. Counts by Peter Norvig (researcher at Google)

    English Letter Freq. Counts by Peter Norvig (researcher at Google)

    • Started by Chris
    • 6 Replies:
    • Reputation: 0
    • Registered: 08-Nov-2012
    • Posts: 2

    An interesting read for those of you looking at designing your own custom layouts, or if you just like big data analysis:

    English Letter Frequency Counts: Mayzner Revisited by Peter Norvig (a Director of Research at Google Inc.), published in January 2013

    --Chris

    Topics: bigrams, n-grams, letter frequency, word counts/lengths

    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,361

    Indeed, that reference is both thorough and well presented.

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Offline
    • 0
    • Shai
    • Administrator
    • Reputation: 36
    • Registered: 11-Dec-2005
    • Posts: 423

    This refutes a common criticism of Colemak that H deserves to be in the home position because 'the' is the most common word in English or because of "ETAOIN SHRDLU"

    Offline
    • 0
    • Reputation: 7
    • Registered: 21-Apr-2010
    • Posts: 818

    Oooh, the n-gram table would make for a good workout.

    --
    Physicians deafen our ears with the Honorificabilitudinitatibus of their heavenly Panacaea, their sovereign Guiacum.

    Offline
    • 0
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978
    Shai said:

    This refutes a common criticism of Colemak that H deserves to be in the home position because 'the' is the most common word in English or because of "ETAOIN SHRDLU"

    Not such good news for Workman then!

    The order I have generated, from the same list out-of-copyright books the carpalx guy uses, is this (including symbols):

    E T A O I N H S R D L U M W C F Y G , P B . V K ' ; ? J X Q : Z

    Peter Norvig's frequency table looks interesting, seemingly being based on such a wide selection of material, I would be interested in trying it out, but the problem is it doesn't include common non-alphabetic symbols.  I would at least want to simulate the main punctuation keys eg comma, dot, semicolon/colon.

    Last edited by stevep99 (30-Jul-2015 14:14:26)

    Using Colemak-DH with Seniply.

    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,361

    The home position alone is no guarantee for a good THE experience anyway. The TH and HE bigrams are important of course, but this can be solved in various ways. I really like these bi- and trigrams in Colemak (especially with the DH-mod).

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Offline
    • 0
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    Indeed, regardless of whether R or H is the slightly more frequent, they are pretty close and only one of them can get a place in the top 8. I instinctively feel Colemak's choice of having R on the home row is correct. But the main thing is, whichever one loses out needs to get a next-best slot, which of course is the point of Mod DH.

    I started looking at the Peter Norvig bigrams data in a bit more detail, comparing it to the data I have been using previously (which I'll call the carpalx data). Interestingly, the order of same-finger bigrams occurrence is also noticably different, although the overall totals are pretty much the same. I have excluded punctuation bigrams because the Norvig data doesn't have them.

    With Norvig bigram frequencies, I find the most frequent Colemak same-finger bigrams are:
    SC 0.1547%
    UE 0.1475%
    PT 0.1058%
    NL 0.0638%
    NK 0.0516%
    KN 0.0514%
    EU 0.0312%
    DG 0.0310%
    WR 0.0308%
    YI 0.0288%

    Whereas the carpalx data gives:
    KN 0.1124%
    UE 0.1110%
    SC 0.1025%
    NK 0.0941%
    NL 0.0840%
    PT 0.0727%
    LK 0.0397%
    YI 0.0393%
    LM 0.0366%
    WR 0.0329%

    The hand balance is also a little different (letter keys only):
    Norvig: L: 48.34%      R: 51.66%
    carpalx: L: 46.96%      R: 53.04%

    Not sure there is any major conclusion to be drawn, but thought I'd post the info anyway.

    Last edited by stevep99 (31-Jul-2015 14:20:21)

    Using Colemak-DH with Seniply.

    Offline
    • 0
      • Index
      • General
      • English Letter Freq. Counts by Peter Norvig (researcher at Google)