• You are not logged in.

    Analysis of Colemak & Mod-DH for some European languages

    • Started by stevep99
    • 10 Replies:
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    I recently added a new ability to my anaylzer to load in a different set of monogram and bigram frequencies. This makes it possible to analyze layouts for different languages, in which the letter and bigram frequencies may be differ somewhat from English. You can try this for yourself on my updated layout analyzer, but the results are also summarized here:

    Note: to generate these, I used frequency tables for the language in question, but the layout tested is the just the standard Colemak(DH)/Dvorak/Qwerty layout, without the special characters or modifications (QWERTZ/AZERTY/etc) which some of these languages use. So it's not a comprehensive analysis as it might be, but hopefully still a useful indicator.

    Danish

    Layout    sf-bigrams  score
    colemak_dh  4.27%     1.714  
    colemak     4.27%     1.750  
    dvorak      4.97%     2.015  
    qwerty     11.83%     2.348  

    German

    Layout      sf-bigrams  score
    colemak_dh    4.98%     1.695  
    colemak       4.98%     1.759  
    dvorak        3.49%     1.900  
    qwerty        9.93%     2.385  

    French

    Layout      sf-bigrams  score
    colemak_dh    3.60%     1.630  
    colemak       3.60%     1.644  
    dvorak        3.56%     1.894  
    qwerty        9.25%     2.348  

    Spanish

    Layout      sf-bigrams  score
    colemak_dh    3.25%     1.655  
    colemak       3.25%     1.683  
    dvorak        2.48%     1.889  
    qwerty        9.65%     2.311  

    Polish

    Layout      sf-bigrams  score
    colemak_dh    3.80%     1.929  
    colemak       3.80%     1.948  
    dvorak        6.82%     2.174  
    qwerty        9.43%     2.474  

    Swedish

    Layout      sf-bigrams   score
    colemak_dh    4.13%      1.708  
    colemak       4.13%      1.730  
    dvorak        4.31%      1.983  
    qwerty       10.15%      2.316  

    On this quick test, it's interesting to note that:

    - Qwerty is by the far the worst in every language :P
    - The language gaining least from Colemak is Polish. French and German gain the most - presumably because these languages are "closer" to English.
    - There is a noticeable same-finger penalty in all languages, but at least in Colemak it's always less severe than Qwerty.
    - Colemak beats Dvorak in each language, although interestingly, Dvorak has low same-finger for Spanish!
    - Mod-DH still comes out the best (of the 4 layouts tested) in each language (yay!)

    Those who know some of these languages might be able interpret the results better than I can.

    Last edited by stevep99 (10-Nov-2017 15:19:16)

    Using Colemak-DH with Seniply.

    Offline
    • 1
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,364

    Nice work, Steve!

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Offline
    • 0
    • Reputation: 0
    • Registered: 11-Apr-2022
    • Posts: 2

    Spanish is very CV (single consonant + single vowel) based, so that would make sense about the low same finger in Dvorak, since vowels are all on the left. I image that languages such as Japanese would be with similar results if they used the Latin alphabet. There is a Spanish version of Dvorak that switches the R and H, which might yield a little different score. This is an old post, but interesting. I was trying to find the single best layout for both English and Spanish, since I write a lot in both.

    Offline
    • 0
    • Reputation: 2
    • Registered: 04-Sep-2022
    • Posts: 14
    stevep99 said:

    I recently added a new ability to my analyzer to load in a different set of monogram and bigram frequencies. This makes it possible to analyze layouts for different languages, in which the letter and bigram frequencies may be differ somewhat from English. You can try this for yourself on my updated layout analyzer, but the results are also summarized here:

    Note: to generate these, I used frequency tables for the language in question, but the layout tested is the just the standard Colemak(DH)/Dvorak/Qwerty layout, without the special characters or modifications (QWERTZ/AZERTY/etc) which some of these languages use. So it's not a comprehensive analysis as it might be, but hopefully still a useful indicator.

    Thanks for that update of the analyzer Steve. I was looking for an alternative layout for English, German and Dutch and a bit French. Colemak is optimized for English only. For us in Europe a more robust layout would be great. There are some full-blown optimized layouts like Bone, Koy, AdNW which can handle several languages pretty well (mostly optimized for English and German). But those layouts change the complete keyboard.

    I like the idea of Colemak to have a 80 or 90 % optimization while staying as close as possible to qwerty. Especially taking into account that one can never optimize to 100%, because the use case is somewhat different for different persons always. Even a single person will handle different tasks and for programming a different layout might be best then writing in a native language. Also keeping zxcv and qw as well having the option to learn in steps is a plus IMO.

    While trying to create a layout with similar goals like Colemak (and to some part also Minimak) I came up with the following layout, which is pretty robust for several languages.

    1  2  3  4  5  6  7  8  9  0  -  =
    q  w  d  f  y k  l  o  u  p  [  ]
    a  r  t s g  ;  n  e  i  h  '
    z  x  c  v  b  j  m  ,  .  / 

    The performance for English is a tad behind Colemak (0.4% higher SFB), but not that much. The layout also solves the problem some people have with the H in the middle-position for vanilla colemak.

    Here are the results of the analyzer (rounded to two significant numbers -- because I think a finer grained resolution suggests a certainty about the results which is not given -- having in mind that it is not clear how to weight the different parameters to match the human perception best)

    Language  sf-bigrams     score     left/right
    English      2.1%         1.8        45 55
    French       2.6%         1.7        45 55
    Spanish      2.6%         1.7        47 53
    German       2.9%         1.7        46 54
    Danish       3.7%         1.8        47 53
    Finnish      3.7%         1.8        40 60
    Swedish      3.7%         1.8        50 50
    Polish       4.6%         2.0        50 50
     

    Almost all languages score better with this layout thant Colemak(-DH). The exception is Polish. I find the trade-off for English worthwhile and guess there are many people who could benefit from such a layout. I personally write roughly 40 or 50 % in English and the rest in German and Dutch.

    The layout can be adopted easily, without sacrificing the performance very much. For example swapping z and y (like on German keyboards). Also using the German umlauts in their normal place would be possible without a problem (ö would then move to the qwerty-h place).

    When I counted correctly there are 15 keys changed (11 change fingers and 2 change hands). That's a tiny bit less than Colemak. Also most keys which change stay relative close to their qwerty-location.

    I will very likely start using that layout for myself. Maybe I'll call it MiniMax - getting most out of it with relative minimal amount of efforts. Also giving a nod to Minimak as well :-)

    Oh, how does the analyzer handle special letters like the umlauts (öäü)? Are those letters skipped or taken into account when one adds those to the layout or are they counted as the letters  oau -- which would not make sense, except one writes the letters with a dead-key -- which is not true for German, but true for Dutch... ;-)

    Could you add Dutch to the analyzer as well? There are word lists available online for free and I can point you to one or send you the data.

    Best, Peter


    Edit

    The analyzer from Arne Babenhauserheide https://dariogoetz.github.io/keyboard_layout_optimizer/ uses a pretty detailed scoring system, trying to cover as many important aspects as possible. This makes it tangible to the "weighting question" especially. But nonetheless I think it can give an additional view on how different layouts perform.

    The results for some layouts are as follows, where the calculated costs (lower is better) are:


    Typical English, News.

    Layout    Cost
    ------------------
    qwertz    524
    
    Neo 2.0   419
    
    MiniMax   377
    Colemak   376
    
    Koy       336
    Bone      335

    German (mixed text)

    Layout    Cost
    ------------------
    qwertz    552
    
    Colemak   382
    Neo 2.0   377
    MiniMax   375
    
    Koy       316
    Bone      292

    It is noticeable that in that model the layouts which do not have any strong restrictions like staying close to qwerty are the best. The question is how much that difference is worth in the daily usage and how much higher the price for learning is -- as well if those layouts are as robust to different use cases like others (or even better!?)

    The Colemak / MiniMax group has roughly 70 % of the qwerty-efforts, while the full-blown group (Bone, Koy) has just about 55 % of the qwerty-efforts in that model for German. For English the differences are lesser and are only about 70% respectively  65 % compared to the qwerty-efforts. So for English I doubt that the difference is noticeable that much, for German that might be the case!?

    On the other side. Under the assumption that the calculated costs correlate well with the human perception of the efforts one has to put into typing a text one could also argue that the Colemak / Minimax layouts result in similar and much lower efforts to qwerty for both English and German, while for German the qwerty efforts are especially high, so the gain is already higher, but interestingly could be even optimized more than the English ones!? I doubt that this calculation reflects that, because English uses less letters and should on an absolute scale be less costly to type than German. So it makes sense that the efforts for German (qwerty) are higher on an absolute scale, but even after optimization the costs for German texts must be higher than for English. So the model is -- with the current weighting factors -- not fully plausible to me.

    Counting SFB and finger-travel-efforts is a good start for sure, but misses important parameters. So IMO we see that a model is as good as our knowledge is about the topic, which in most cases is not that good as one might think first.

    Btw this relates to many other 'hot topics' as well where we are told that the models would tell some "truth"! I won't go deeper into that off-topic, but maybe someone should think about that more often or deeper!?

    Last edited by rpnfan (04-Sep-2022 20:56:42)
    Offline
    • 0
    • Reputation: 214
    • From: Viken, Norway
    • Registered: 13-Dec-2006
    • Posts: 5,364

    That is a very tricky issue, as nearly nobody will have the same usage pattern. For most of us, English constitutes a largish percentage of our everyday typing, but what we do beyond that is highly individual. So hardly anybody needs a layout that works well for five different languages, and if people are like me they type so much English anyway that they'll want a layout that functions nearly optimally for English and okay for the other language(s). Colemak fits that bill.

    For some "second" languages it feels less nice. I know of Polish, which is to be expected since it's a Slavic language and as such quite different from the romance-germanic language group that English sits comfortably within. Dutch and Portuguese have also been mentioned from time to time. The big issue for Dutch seems to be the much higher frequency of J which is put in the corner by Colemak – a justifiable decision for English, of course. In my Dutch locale variant I try to remedy the issue by adding a mapping for IJ (and/or its digraph counterpart IJ). My own native language Norwegian has some minor issues such as a prevalence of the KJ bigram, but I don't mind as I just alt-finger KJ quite easily.

    The quest to find a layout that fits more languages more nicely than Colemak without sacrificing anything major for English is interesting. Somehow, I'm not sure it'll be a great success though? When it comes to what's best for a multilingual typist who's willing to go the extra mile and use something even less standard than Colemak, I suspect the optimal answer will be to use a layout generator based on a sample of their most used text. Someone who uses English and German doesn't need a layout that's good for French, etc etc. Those who are willing to leave the highway in search of min-maxing, may be most happy with a completely individualized solution. With today's tools that isn't very hard and I guess the tools will get better at it too?

    Also, while you may feel that 0.4% higher SFB rate than Colemak isn't much I know of many who think the opposite: That their layout should have lower SFB% than Colemak, if anything. The Alt Keyboard Layout people are getting increasingly hungry for optimalization of not only SFBs but other measures like skipgrams and different movement patterns like redirects.

    Some tools like Semilin's Genkey program aim to let people generate layouts based on their own preferences and abilities. These tools aren't user-friendly yet, but that will likely change.

    Last edited by DreymaR (06-Sep-2022 14:15:43)

    *** Learn Colemak in 2–5 steps with Tarmak! ***
    *** Check out my Big Bag of Keyboard Tricks for Win/Linux/TMK... ***

    Offline
    • 0
    • Reputation: 2
    • Registered: 04-Sep-2022
    • Posts: 14

    Hi DreymaR, thanks for the answer.

    I agree that it is a bit tricky to optimize for several languages at the same time. But it is surely possible _and_ it is of course needed IMO. Many people will write a significant amount in their native language and then in English as a second language. This is true for many Germans. And sure for many others in Europe and world-wide. Sometimes more languages will be thrown into the mix. Think of Belgium where a large part of the population speaks also at least two languages, often three or more. The same in Switzerland. So I think it is a serious disadvantage that Colemak is not very robust to handle other languages reasonably well. Sure, when English is the main language and one will write a bit in other languages the lack of performance for non-english might still be o.k. Sure getting SFB down from 10% to about 5% or 4% is already an improvement.

    But I think there is a place and even a need for a more language-robust layout which should perform pretty good in English, but also not much worse in one or a few selected languages. I tried to rearrange some keys in Colemak, but you do not get very far with that. So I think a different base-layout is needed. Therefore my attempt to come up with an alternative layout.

    Of course SFB's are not everything, but surely a relevant part of a good or bad layout. Let's just have a quick first look at the SFB's from my sample layout compared to Colemak.

    Colemak has 1.7 for English, but 5.0 for German. So when German is as important or even more important you'll end up with an average SFB of 3.4, while with my example layout you'd get 2.1 + 2.9 → 2.5 on average. Also the languages are much closer  (Delta SFB 0.8 in contrast to Delta SFB 2.3 for Colemak), which I think is something worthwhile to achieve.

    I had a deeper look into my example layout and looked not only at the SFB number, but also looked if those are bad SFB's (bottom-row to top-row jump and / or on a weak finger). Again I tried to optimize for English, German and Dutch (that Spanish and French score pretty well is a nice addition here, but was not my personal focus). By rearranging a few more keys (I think I'm at 19 max now -- so similar to Colemak-DH) I got a layout which is improved in English _and_ in German. Now the difference for English is even neglectable with roughly the same SFB count. :-)

    Assuming that it's fine to post new layout ideas in a Colemak forum I'll do so. What's the best place for that topic (New layout suggestion - optimized for English and multiple languages). BTW, the layout looks not too different from Colemak. That is to be expected, because I made the same restrictions (keep ZXCV, try not to move too many keys at all and avoid changing hands largely).

    I do not think that everybody should or would need to get out and try to create his own layout from scratch. Especially when one wants to stick to the mentioned restrictions one will not get a totally optimized layout anyways. Also not to forget that any optimization is totally dependent on the model, the parameters, assumptions and the training data. So in real life a layout will never be 100% -- it can not be. Even one person uses the computer for different tasks and would need a different optimization for those.

    Last not least. In my experience it is not worth to try to squeeze the last few percent of optimizations in a super-duper-layout, when one has not taken the much more important steps first:

    * Ergonomic keyboard
    * Better solution for shortcuts, Enter, Shift (thus likely changing to a Thumb-Keyboard -- where currently there are not many available with a really good thumb cluster (Advantage, Model 01 / 100 likely the most interesting).
    * Navigation and editing possible from the home-row (arrow keys, Ins, Del, Backspace, Home, End...)


    And after all optimizations should be robust as far as possible IMO and work on the "damn" laptop keyboards still reasonably. This is an area where Neo 2 (the established German alternative layout) fails totally, because it relies on too many keys which are not available on all keyboards and if they are they are often not in the same position, therefore not easy to learn to touch-type or are out of reach from a home-row position anyways (AltGr is such a key which I think is not a good solution to try to access any characters which are used more often than once a month).

    @Steve: Would be great when you could add more languages to the analyzer. This would allow to better check for (slight) optimizations of a layout for a target language. I also would love to see more bigrams listed than the top 5 only. :-)

    Offline
    • 0
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    To add more languages, what I need is the frequencies of each individual letter, plus those of every possible bigram pair, for each supported language.
    Kind of like these, but they only have a limited number of languages, and they don't consider spaces since they are interested in cryptographic applications rather than typing. If you know of a good source of data I could add them.

    Last edited by stevep99 (10-Sep-2022 17:20:54)

    Using Colemak-DH with Seniply.

    Offline
    • 0
    • Reputation: 2
    • Registered: 04-Sep-2022
    • Posts: 14

    Hi Steve,

    I found a source with that information for many languages

    https://www.sttmedia.com/wordcreator-frequencies
    https://www.sttmedia.com/characterfrequencies
    https://www.sttmedia.com/syllablefrequencies

    Dutch is there and many other languages!

    I e-mailed to the author and you can use those, as long you name the source and include a link to the homepage. Would be great when you can add more languages to your analyzer.

    The program which was used to create those is even available there as well. :-)

    Offline
    • 0
    • Reputation: 2
    • Registered: 04-Sep-2022
    • Posts: 14

    Might be known already here, but for me it was new that one can find copora of many languages at:

    https://wortschatz.uni-leipzig.de/en/download

    With the help of the adnw optimizer (http://509.ch/opt.7z) and / or with the above mentioned programs one can extract bigrams, trigrams and so on. :)

    Offline
    • 0
    • Reputation: 2
    • Registered: 04-Sep-2022
    • Posts: 14
    stevep99 said:

    To add more languages, what I need is the frequencies of each individual letter, plus those of every possible bigram pair, for each supported language.

    Hi Steve, can you use the bigram pairs which I posted the link from? Would be great to add more languages to the analyzer in that way. :-)

    Offline
    • 0
    • Reputation: 117
    • From: UK
    • Registered: 14-Apr-2014
    • Posts: 978

    Sorry, must have previously missed these comments. The sttmedia one looks interesting - It's not totally clear where they got their corpus from, but they do have letter and bigram frequencies which is what I'd need. They list the most common ones on their website but I didn't see an obvious way to just download the data.

    The Leipzig one also looks promising. It's a bit clearer how they generated their results and you can download the raw data. However as you pointed out, it's in a different format so would need a bit of processing. I haven't looked at the adnw optimizer before, alternatively it probably wouldn't be too hard to write a script or something. I'll probably have a closer look at this one in the next few days.

    Last edited by stevep99 (31-Dec-2022 14:34:10)

    Using Colemak-DH with Seniply.

    Offline
    • 0