• You are not logged in.
  • Index
  • General
  • Key frequency analysis, any broader DB than project Gutenberg?

    Key frequency analysis, any broader DB than project Gutenberg?

    • Started by 白い熊
    • 13 Replies:
    • Reputation: 0
    • Registered: 15-Sep-2013
    • Posts: 4

    I'm doing letter frequency analysis for the purpose of improving on Colemak and developing my own keyboard layout.

    So far I've downloaded the Project Gutenberg 2010 April DVD and ran analysis on that. It contains cca 30,000 books with 2.7 billion characters - a good sample.

    However it has issues - older texts, more complex sentences, i.e. overuse of the comma as opposed to the period and so. Also, I haven't found any newer/bigger Project Gutenberg DVD versions.

    Is anyone aware of a more modern text / modern typing database for download / analysis? Maybe not even that broad, but with modern text / punctuation...

    Offline
    • 0
    • Reputation: 0
    • Registered: 04-Apr-2013
    • Posts: 538

    mdrn typng? wot r u smkn dude

    More seriously, you don't need 2.7 billion characters of data - though I suppose it doesn't hurt.  Just get a reasonably wide selection of more-modern free ebooks (the GNU Emacs Manual, perhaps?) and your analysis would probably be nearly as good.  You can also crawl through several popular forums or reddit for genuine examples of "modern typing"; some of them have debating sections whose quality are probably as good as any book.

    If you're interested in programmers' key analysis, that's a bit harder, since you can't exactly paste the source code and expect an accurate answer.  To try to collect at least function frequency data, I've been using a custom keyfreq.el - an ideal solution would record all the "function-strokes" in order rather than just aggregating them.

    Offline
    • 0
    • Reputation: 4
    • Registered: 08-Dec-2010
    • Posts: 656

    There are so many recent bestsellers in rapidshare and torrents. Hopefully you can convert them to text for your own non-profit use.

    For libraries, they will not give you their contents even if they have, for copyright reasons.

    Last edited by Tony_VN (16-Sep-2013 12:11:54)
    Offline
    • 0
    • Reputation: 1
    • From: Tampa, FL, USA
    • Registered: 24-Aug-2012
    • Posts: 24

    Which corpus you use will make a difference.  The question is what purpose your layout is for.  Is it for writers?  Because they use different vocabulary than most people, and that will show in the result.

    If you're looking to improve the layout to suit your own typing style, you're best off using a real sample of your own writing.  I did the same thing.  Using Python and an IMAP library, I connected to Gmail and downloaded the last 10 years of my sent emails.  After stripping out quotations of other people's emails and headers, I was left with a good sample of my own writing.  You may want to do the same.  Unfortunately I didn't save the code to do the download since I was doing it in a live Python interpreter session.

    If you're looking for someone who's already done letter frequency breakdowns for various purposes, including programming, mtgap is the best source I've found.  You may need to wander around a bit in there to find what you want.

    What metrics have you chosen as your basis for improvement?

    Minimak - Better typing without losing QWERTY
    http://www.minimak.org/

    Offline
    • 0
    • Reputation: 0
    • Registered: 15-Sep-2013
    • Posts: 4

    Yes indeed, a coding keyboard would look much different from a writer's keyboard. So you can't really have a unified keyboard. I'm designinig a keyboard for general modern typing: communication, writing, short messages - that's why the word `modern'.

    Where I would like to `improve' over Colemak is a number of areas:

    - Primary finger/hand usage
    Your strongest - most coordinated fingers are the index and middle fingers. For righthanded typists the right hand fingers also being much more coordinated. Thus, even on the home row these should be prefered.

    - Homerow
    My analysis basically confirms the traditionally expected: etaoin shrdlu, with an insignificantly higher frequency of `n' over `i' in Project Gutenberg. So the 10 homerow keys should be: etaoinshrd. The question is location of these keys with respect to the prior point, which I think can be improved.

    - Extended characters / punctuation
    It is apparent that some extended keys must be in a better position. Namely Return (Enter) has a frequency ranked right after `u', putting it in 13th. With `Shift' it gets even more interesting. Based on capital letter frequency, if there was one shift it is ranked higher than Return, actually higher than `u'. Therefore an interesting idea: put `Return' where `t' is in QWERTY and `Shift' where `y' is, and have them be accessed not in pure touchtyping mode, but both hands. Seems to make sense.

    This also applies to backspace.

    - Numbers
    1 and 0 have a much higher frequency than other numbers, they should be in the middle not towards the edges.

    Alas extended characters is where the modern typing sample issue arises. Project Gutenberg gives a 60% higher frequency of comma then period, which is apparenly determined by archaic sentence structure. This might be the case for other characters too. That's why I'd like a good modern sample.

    I'd be very interested to hear thoughts / discuss at length my reasoning here.

    Offline
    • 0
    • Reputation: 0
    • Registered: 04-Apr-2013
    • Posts: 538
    白い熊 said:

    Yes indeed, a coding keyboard would look much different from a writer's keyboard. So you can't really have a unified keyboard.

    Well, the problem is that is that learning two keyboards is a lot harder than learning one - and one is already rare. 

    白い熊 said:

    `Shift' where `y' is

    This could lead to stretching problems (try capitalizing something on the right-side bottom row) unless "sticky keys" were used or there was a left-handed shift. 

    t seems like an interesting place to put Return (my old position was AltGr+t before it was replaced with ESC due to vim) but I sort of doubt it deserves a home-row position.  After all, it only normally gets used at most once per sentence., I've also found that, with the wide layout (where your right fingers are home-rowed at QWERTY kl;'), shifting one key to enter enter was not a big deal - an inner shift might be considered as well.

    Having a "holistically designed" layout, rather than a modular one like the usual colemak base + modifier layer, would be interesting.

    Offline
    • 0
    • Reputation: 1
    • From: Tampa, FL, USA
    • Registered: 24-Aug-2012
    • Posts: 24

    I'm going to suggest that some of your points bear some research/thinking.  These are not intended as criticisms of your thinking.

    白い熊 said:

    - Primary finger/hand usage
    Your strongest - most coordinated fingers are the index and middle fingers.

    This is an assertion which may be true, and certainly fits my intuition.  However, I've stopped short of claiming that I know this is true for a fact.  I have not been able to find any research to confirm/reject the hypothesis that the index and middle are stronger or more coordinated than the ring finger.  Pinky is pretty obviously not.  The only physiological fact I've been able to determine is that the middle and ring fingers share a tendon at the wrist.  While I'm not in disagreement with you, it would be good to have the claim backed up by something more than anecdotal evidence.

    This ends up being possibly important because of another, provable effect, that of finger rolls.  It's easier for your fingers to activate from pinky to index than the other way around.  When it comes to the frequently used letters, there's the possibility that a ring finger placement might give more opportunities for rolls to occur in the proper direction.  This depends on whether the digraph frequency has the frequently used letter in front of the alternate letters more frequently than not.  If the ring finger is just as good as the middle finger or even just slightly worse when it comes to strength (coordination shouldn't matter as much since presumably we're talking about a home position), then it may make sense to place the frequently used key on the ring finger.

    白い熊 said:

    For righthanded typists the right hand fingers also being much more coordinated. Thus, even on the home row these should be prefered.

    I doubt this is true.  Again, I've got no evidence.  However, I don't notice any perceptible difference in my left versus right-hand typing.  In any case, hand alternation is a more important metric than putting all the frequently used keys on the right hand.  Once you've grouped according to good hand alternation though, you can choose to put the group which happens to be more frequent on the right hand if you desired.

    白い熊 said:

    - Homerow
    My analysis basically confirms the traditionally expected: etaoin shrdlu, with an insignificantly higher frequency of `n' over `i' in Project Gutenberg. So the 10 homerow keys should be: etaoinshrd. The question is location of these keys with respect to the prior point, which I think can be improved.

    I think it's a mistake to focus on the home row without taking into account the natural flexion of the fingers.  You should take a look at the Workman layout's rationale if you haven't already.  The top row keys above the index and ring fingers are very close to the usability of the home row keys (in fact, I've seen suggestions that a better home "row" would have your index and ring rest on those keys rather than the traditional home).  If you've seen carPalx's model, which is strongly based on the home row, know that he thinks this may be a weakness of the model.

    That's all I've got for you.  Just some food for thought, I'm sure you're on a good track for your needs.

    Minimak - Better typing without losing QWERTY
    http://www.minimak.org/

    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,362

    To me, the needs of the coder are different from the needs of the general typist. Navigation/editing is much more important than pure typing to a coder, so further improvements like an Extend layer are much more important than key placements. I hardly ever spew out letters in an even flow when coding – it's always a few words at a time and then movement/editing (or thinking).

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Online
    • 0
    • Reputation: -1
    • Registered: 09-Apr-2012
    • Posts: 7

    @lilleyt
    That was a great post, and I do agree that we can't take intuition and even so called expert research for granted. For example, many keyboard experts say that the bottom row should be avoided. That is, they have a metric that automatically groups all bottom row keys as the least desirable, so the least frequent letters would be put there. However, as I came up with my own effort model and developed a layout based on such a model, and even putting two frequent letters on the bottom row, I found that the layout is just as efficient as other popular layouts, and in some areas exceed them.

    Here is the effort model that I came up with based on personal experience:
    keyboard_effort.png

    Lower values mean less effort. Actually the keys should be columnar, but I can only a find a staggered template image. So that's why the values are symmetric. Nevertheless, as you can see from this chart, that two keys on the bottom row actually use less effort than all the keys on the top row and even two keys on the home row. They are the index keys on the bottom row. This defies what many experts have claimed. To test this, I put two frequent letters, H and D on these keys. I quote from my results using an online layout test tool (my layout is called Balanced):

    One thing worth mentioning is row jumping. It is almost mantra among keyboard enthusiasts that the top row is preferable to the bottom row, and that placing less frequent keys on the bottom row will avoid row jumping. Yet the [results] here clearly busts this myth. On the Balanced layout, two of the most common letters H and D are intentionally placed on the bottom row. One would expect that row jumping would thus be a big problem for Balanced. However, this is not the case. Row jumping on the Balanced layout happens a mere 0.3% more often than Dvorak and Colemak. If you type 1000 characters, row jumping happens 12 times instead of 9.

    There are other similar findings from this chart that would seem unintuitive. Like that two side index keys on the home row actually have higher effort than some of the keys on the top and bottom rows. That's because the index finger is better at bending inward toward your body then stretching side to side.

    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,362

    Maybe it'd look clearer if you multiplied all your weightings by 4 and reported them as integers? The calculations would be the same of course.

    Those weightings look very decent to me – but then again I use a mod to preserve the left hand wrist angle and if I didn't then I'd feel that the left-hand bottom row keys were more awkward. The standard row stagger is good for the right hand but bad for the left hand.

    [Edit: Sorry, I didn't see the part where you state that you use a columnar layout. That obviously changes things for you. I'll let this point stand for general discussion purposes.]

    To me, the upper row keys WF/UY are easy. In the resting position, my hands hover on the upper edge of the home row keys so there's practically no distance to these upper row keys for the long and agile index and middle fingers. If it were me, I'd use no more than 0.75/1.0 penalties for those keys – possibly even 0.5/0.75. This is a point where individuality comes up: Resting position and finger lengths make those keys easy for me but maybe not for everyone!

    Also, my right-hand pinky doesn't mind the upper-row key all that much; I'd probably end up with a 2.0 there. Again, I feel that the stagger works for it there. Probably doesn't mean a lot though.

    The beauty of Colemak in my view is that it actually caters to nearly all of these experiences without moving too many keys around. That's quite an accomplishment.

    Last edited by DreymaR (23-Sep-2013 09:45:40)

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Online
    • 0
    • Reputation: 0
    • Registered: 15-Sep-2013
    • Posts: 4
    lalop said:
    白い熊 said:

    `Shift' where `y' is

    This could lead to stretching problems (try capitalizing something on the right-side bottom row) unless "sticky keys" were used or there was a left-handed shift.

    Yeah, true. What I mean though, is break the touch-typing rules and have both the left and right index fingers hit the shift key, based on which side of the keyboard the shifted key is. There is a medium stretch when hitting with the left index finger, but seams feasible to me...

    It seems like an interesting place to put Return (my old position was AltGr+t before it was replaced with ESC due to vim) but I sort of doubt it deserves a home-row position.  After all, it only normally gets used at most once per sentence.

    Well, exactly because of this I was interested in the extended character analysis. And it turns out that only 12 letters get used more often than the Return key and most other letters, much less so. So it seems to me Return deserves a comfortable position - not a pinky one. Currently I'm actually leaning towards 'V'.

    I've also found that, with the wide layout (where your right fingers are home-rowed at QWERTY kl;'), shifting one key to enter enter was not a big deal - an inner shift might be considered as well.

    Yes, my thoughts too, however the only key available then is the double-quote key, as the regular Return is way too far. And I'm just thinking that since Return is such a heavyweight, it should be operated by the strongest finger, not the weakest.

    Having a "holistically designed" layout, rather than a modular one like the usual colemak base + modifier layer, would be interesting.

    I'm currently experimenting, typing on it already, with a third version of such a layout already, once I settle on what seems a good solution, I'll follow up here, to have it subjected to criticism...

    Offline
    • 0
    • Reputation: 0
    • Registered: 15-Sep-2013
    • Posts: 4
    lilleyt said:

    This is an assertion which may be true, and certainly fits my intuition.  However, I've stopped short of claiming that I know this is true for a fact.  I have not been able to find any research to confirm/reject the hypothesis that the index and middle are stronger or more coordinated than the ring finger.  Pinky is pretty obviously not.  The only physiological fact I've been able to determine is that the middle and ring fingers share a tendon at the wrist.  While I'm not in disagreement with you, it would be good to have the claim backed up by something more than anecdotal evidence.

    Yeah, I agree with you completely. I've searched for research on this topic, but haven't been able to locate any. So, basically these are statements based in let's say educated belief - educated by the fact that I'm a very profficient typist with 25 years of intense typing.

    However, most of this reasoning I'm deducing by basically feel - meaning I postulate a though that seems generally plausible and then spend a couple of hours, typing and focusing on the thought.

    So, I'm strongly convinced that the index and middle, are much stronger and coordinated than, of course, the pinky, but also the ring ones.

    This ends up being possibly important because of another, provable effect, that of finger rolls.  It's easier for your fingers to activate from pinky to index than the other way around.

    I think so too, and here again, it's only based on feel. But I think this point most typists will agree with easily. Once trying inward rolls vs outward, it's apparent to feel the difference right away.

    When it comes to the frequently used letters, there's the possibility that a ring finger placement might give more opportunities for rolls to occur in the proper direction.  This depends on whether the digraph frequency has the frequently used letter in front of the alternate letters more frequently than not.

    I have given this a lotta though, but when you look at digram and trigram frequencies in the English language, for instance at http://www.cryptograms.org/letter-frequencies.php
    it doesn't seem to support this. The only digram that seems frequent enough is `th', with the possible trigram `the', but everything else seems so low in frequency that it doesn't seem to warrant heavy roll consideration.

    白い熊 said:

    For righthanded typists the right hand fingers also being much more coordinated. Thus, even on the home row these should be prefered.

    I doubt this is true.  Again, I've got no evidence.  However, I don't notice any perceptible difference in my left versus right-hand typing.

    I agree with you with regard to typing, but I would attribute it to mental paths developed by touchtyping practice. But again, a scientifically unsubstatiated feel would lead me to say right is generally better than left, but not go overboard on this. Basically meaning, when considering equivalent placement on either side of the keyboard, the higher frequency letter should go on the right, whereas the next-in-line left.

    In any case, hand alternation is a more important metric than putting all the frequently used keys on the right hand.  Once you've grouped according to good hand alternation though, you can choose to put the group which happens to be more frequent on the right hand if you desired.

    This seems to me to be the hardest part of an effective layout design. Maybe even warranting some alternation frequency analysis, need to consider this with my test layout.

    I think it's a mistake to focus on the home row without taking into account the natural flexion of the fingers.  You should take a look at the Workman layout's rationale if you haven't already.  The top row keys above the index and ring fingers are very close to the usability of the home row keys (in fact, I've seen suggestions that a better home "row" would have your index and ring rest on those keys rather than the traditional home).

    I agree with you, these keys are very operable, much more so, it seems at times than keys on the extreme ends of the home row. I've developed a purely numerical ranking of key operability, which I'm testing with the key layout, and it favors these keys, but maybe still not enough, I need to do more thinking on this...

    Thanks a lot for the criticism, and hope there's more...

    Offline
    • 0
    • Reputation: 4
    • Registered: 08-Dec-2010
    • Posts: 656

    Different weight models bring different layouts. Even for one fixed model there would be several thousand layouts that fits the bill.

    Although with all good rollings and key combinations, as a Colemak user I am only 98.7856235534% satisfied with Colemak, but somehow that's all right.

    Last edited by Tony_VN (24-Sep-2013 11:34:09)
    Offline
    • 0
    • Reputation: 1
    • From: Tampa, FL, USA
    • Registered: 24-Aug-2012
    • Posts: 24

    Sure thing, I'm glad you took my comments in the constructive spirit in which they were intended.  Sounds like you've got a really good grasp of what you're trying to do.

    Cheers.

    Minimak - Better typing without losing QWERTY
    http://www.minimak.org/

    Offline
    • 0
      • Index
      • General
      • Key frequency analysis, any broader DB than project Gutenberg?