FORMATS/IM-tut.txt

   1 /* Copyright (C) 2007
   2      National Institute of Advanced Industrial Science and Technology (AIST)
   3      Registration Number H15PRO112
   4    See the end for copying conditions.  */
   5
   6 /***
   7
   8 @page mdbTutorialIM Tutorial of input method
   9
  10 @section im-struct Structure of an input method file
  11
  12 An input method is defined in a *.mim file with this format.
  13
  14 @verbatim
  15 (input-method LANG NAME)
  16
  17 (description (_ "DESCRIPTION"))
  18
  19 (title "TITLE-STRING")
  20
  21 (map
  22   (MAP-NAME
  23     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  24     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  25     ...)
  26   (MAP-NAME
  27     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  28     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  29     ...)
  30   ...)
  31
  32 (state
  33   (STATE-NAME
  34     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  35     ...)
  36   (STATE-NAME
  37     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  38     ...)
  39   ...)
  40 @endverbatim
  41 Lowercase letters and parentheses are literals, so they must be
  42 written as they are.  Uppercase letters represent arbitrary strings.
  43
  44 KEYSEQ specifies a sequence of keys in this format:
  45 @verbatim
  46   (SYMBOLIC-KEY SYMBOLIC-KEY ...)
  47 @endverbatim
  48 where SYMBOLIC-KEY is the keysym value returned by the xev command.
  49 For instance
  50 @verbatim
  51   (n i)
  52 @endverbatim
  53 represents a key sequence of \<n\> and \<i\>.
  54 If all SYMBOLIC-KEYs are ASCII characters, you can use the short form
  55 @verbatim
  56   "ni"
  57 @endverbatim
  58 instead.  Consult @ref mdbIM for Non-ASCII characters.
  59
  60 Both MAP-ACTION and BRANCH-ACTION are a sequence of actions of this format:
  61 @verbatim
  62   (ACTION ARG ARG ...)
  63 @endverbatim
  64 The most common action is <tt>insert</tt>, which is written as this:
  65 @verbatim
  66   (insert "TEXT")
  67 @endverbatim
  68 But as it is very frequently used, you can use the short form
  69 @verbatim
  70   "TEXT"
  71 @endverbatim
  72 If <TT>"TEXT"</TT> contains only one character "C", you can write it as
  73 @verbatim
  74   (insert ?C)
  75 @endverbatim
  76 or even shorter as
  77 @verbatim
  78   ?C
  79 @endverbatim
  80 So the shortest notation for an action of inserting "a" is
  81 @verbatim
  82   ?a
  83 @endverbatim
  84
  85 @section im-upcase Simple example of capslock
  86
  87 Here is a simple example of an input method that works as CapsLock.
  88
  89 @verbatim
  90 (input-method en capslock)
  91 (description (_ "Upcase all lowercase letters"))
  92 (title "a->A")
  93 (map
  94   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
  95            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
  96            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
  97            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
  98            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
  99            ("z" "Z")))
 100 (state
 101   (init (toupper)))
 102 @endverbatim
 103
 104 When this input method is activated, it is in the initial condition of
 105 the first state (in this case, the only state <tt>init</tt>).  In the
 106 initial condition, no key is being processed and no action is
 107 suspended.  When the input method receives a key event \<a\>, it
 108 searches branches in the current state for a rule that matches \<a\>
 109 and finds one in the map <tt>toupper</tt>.  Then it executes MAP-ACTIONs
 110 (in this case, just inserting "A" in the preedit buffer).  After all
 111 MAP-ACTIONs have been executed, the input method shifts to the initial
 112 condition of the current state.
 113
 114 The shift to <em>the initial condition of the first state</em> has a special
 115 meaning; it commits all characters in the preedit buffer then clears
 116 the preedit buffer.
 117
 118 As a result, "A" is given to the application program.
 119
 120 When a key event does not match with any rule in the current state,
 121 that event is unhandled and given back to the application program.
 122
 123 Turkish users may want to extend the above example for "İ" (U+0130:
 124 LATIN CAPITAL LETTER I WITH DOT ABOVE).  It seems that assigning the
 125 key sequence \<i\> \<i\> for that character is convenient.  So, he
 126 will add this rule in <tt>toupper</tt>.
 127
 128 @verbatim
 129     ("ii" "İ")
 130 @endverbatim
 131
 132 However, we already have the following rule:
 133
 134 @verbatim
 135     ("i" "I")
 136 @endverbatim
 137
 138 What will happen when a key event \<i\> is sent to the input method?
 139
 140 No problem.  When the input method receives \<i\>, it inserts "I" in the
 141 preedit buffer.  It knows that there is another rule that may
 142 match the additional key event \<i\>.  So, after inserting "I", it
 143 suspends the normal behavior of shifting to the initial condition, and
 144 waits for another key.  Thus, the user sees "I" with underline, which
 145 indicates it is not yet committed.
 146
 147 When the input method receives the next \<i\>, it cancels the effects
 148 done by the rule for the previous "i" (in this case, the preedit buffer is
 149 cleared), and executes MAP-ACTIONs of the rule for "ii".  So, "İ" is
 150 inserted in the preedit buffer.  This time, as there are no other rules
 151 that match with an additional key, it shifts to the initial condition
 152 of the current state, which leads to commit "İ".
 153
 154 Then, what will happen when the next key event is \<a\> instead of \<i\>?
 155
 156 No problem, either.
 157
 158 The input method knows that there are no rules that match the \<i\> \<a\> key
 159 sequence.  So, when it receives the next \<a\>, it executes the
 160 suspended behavior (i.e. shifting to the initial condition), which
 161 leads to commit "I".  Then the input method tries to handle \<a\> in
 162 the current state, which leads to commit "A".
 163
 164 So far, we have explained MAP-ACTION, but not
 165 BRANCH-ACTION.  The format of BRANCH-ACTION is the same as that of MAP-ACTION.
 166 It is executed only after a matching rule has been determined and the
 167 corresponding MAP-ACTIONs have been executed.  A typical use of
 168 BRANCH-ACTION is to shift to a different state.
 169
 170 To see this effect, let us modify the current input method to upcase only
 171 word-initial letters (i.e. to capitalize).  For that purpose,
 172 we modify the "init" state as this:
 173
 174 @verbatim
 175   (init
 176     (toupper (shift non-upcase)))
 177 @endverbatim
 178
 179 Here <tt>(shift non-upcase)</tt> is an action to shift to the new state
 180 <tt>non-upcase</tt>, which has two branches as below:
 181
 182 @verbatim
 183   (non-upcase
 184     (lower)
 185     (nil (shift init)))
 186 @endverbatim
 187
 188 The first branch is simple.  We can define the new map <tt>lower</tt> as the
 189 following to insert lowercase letters as they are.
 190
 191 @verbatim
 192 (map
 193   ...
 194   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 195          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 196          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 197          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 198          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 199          ("z" "z")))
 200 @endverbatim
 201
 202 The second branch has a special meaning.  The map name <tt>nil</tt> means
 203 that it matches with any key event that does not match any rules in the
 204 other maps in the current state.  In addition, it does not
 205 consume any key event.  We will show the full code of the new input
 206 method before explaining how it works.
 207
 208 @verbatim
 209 (input-method en titlecase)
 210 (description (_ "Titlecase letters"))
 211 (title "abc->Abc")
 212 (map
 213   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 214            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 215            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 216            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 217            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 218            ("z" "Z") ("ii" "İ"))
 219   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 220          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 221          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 222          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 223          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 224          ("z" "z")))
 225 (state
 226   (init
 227     (toupper (shift non-upcase)))
 228   (non-upcase
 229     (lower (commit))
 230     (nil (shift init))))
 231 @endverbatim
 232
 233 Let's see what happens when the user types the key sequence \<a\> \<b\> \< \>.
 234 Upon \<a\>, "A" is inserted into the buffer and the state shifts to <tt>non-upcase</tt>.
 235 So, the next \<b\> is handled in the <tt>non-upcase</tt> state.
 236 As it matches a
 237 rule in the map <tt>lower</tt>, "b" is inserted in the preedit buffer and characters in the
 238 buffer ("Ab")
 239 are committed explicitly by the "commit" command in BRANCH-ACTION.  After
 240 that, the input method is still in the <tt>non-upcase</tt> state.  So the next \< \>
 241 is also handled in <tt>non-upcase</tt>.  For this time, no rule in this state
 242 matches it.  Thus the branch <tt>(nil (shift init))</tt> is selected and the
 243 state is shifted to <tt>init</tt>.  Please note that \< \> is not yet
 244 handled because the map <tt>nil</tt> does not consume any key event.
 245 So, the input method tries to handle it in the <tt>init</tt> state.  Again no
 246 rule matches it.  Therefore, that event is given back to the application
 247 program, which usually inserts a space for that.
 248
 249 When you type "a quick blown fox" with this input method, you get "A
 250 Quick Blown Fox".  OK, you find a typo in "blown", which should be
 251 "brown".  To correct it, you probably move the cursor after "l" and type
 252 \<Backspace\> and \<r\>.  However, if the current input method is still
 253 active, a capital "R" is inserted.  It is not a sophisticated
 254 behavior.
 255
 256 @section im-surrounding-text Example of utilizing surrounding text support
 257
 258 To make the input method work well also in such a case, we must use
 259 "surrounding text support".  It is a way to check characters around
 260 the inputting spot and delete them if necessary.  Note that
 261 this facility is available only with Gtk+ applications and Qt
 262 applications.  You cannot use it with applications that use XIM
 263 to communicate with an input method.
 264
 265 Before explaining how to utilize "surrounding text support", you must
 266 understand how to use variables, arithmetic comparisons, and
 267 conditional actions.
 268
 269 At first, any symbol (except for several preserved ones) used as ARG
 270 of an action is treated as a variable.  For instance, the commands
 271
 272 @verbatim
 273   (set X 32) (insert X)
 274 @endverbatim
 275
 276 set the variable <tt>X</tt> to integer value 32, then insert a character
 277 whose Unicode character code is 32 (i.e. SPACE).
 278
 279 The second argument of the <tt>set</tt> action can be an expression of this form:
 280
 281 @verbatim
 282   (OPERATOR ARG1 [ARG2])
 283 @endverbatim
 284
 285 Both ARG1 and ARG2 can be an expression.  So,
 286
 287 @verbatim
 288   (set X (+ (* Y 32) Z))
 289 @endverbatim
 290
 291 sets <tt>X</tt> to the value of <tt>Y * 32 + Z</tt>.
 292
 293 We have the following arithmetic/bitwise OPERATORs (require two arguments):
 294
 295 @verbatim
 296   + - * / & |
 297 @endverbatim
 298
 299 these relational OPERATORs (require two arguments):
 300
 301 @verbatim
 302   == <= >= < >
 303 @endverbatim
 304
 305 and this logical OPERATOR (requires one argument):
 306
 307 @verbatim
 308   !
 309 @endverbatim
 310
 311 For surrounding text support, we have these preserved variables:
 312
 313 @verbatim
 314   @-0, @-N, @+N (N is a positive integer)
 315 @endverbatim
 316
 317 The values of them are predefined as below and can not be altered.
 318
 319 <ul>
 320 <li> <tt>@-0</tt>
 321
 322 -1 if surrounding text is supported, -2 if not.
 323
 324 <li> <tt>@-N</tt>
 325
 326 The Nth previous character in the preedit buffer.  If there are only M
 327 (M<N) previous characters in it, the value is the (N-M)th previous
 328 character from the inputting spot.
 329
 330 <li> <tt>@+N</tt>
 331
 332 The Nth following character in the preedit buffer.  If there are only M
 333 (M<N) following characters in it, the value is the (N-M)th following
 334 character from the inputting spot.
 335
 336 </ul>
 337
 338 So, provided that you have this context:
 339
 340 @verbatim
 341   ABC|def|GHI
 342 @endverbatim
 343
 344 ("def" is in the preedit buffer, two "|"s indicate borders between the
 345 preedit buffer and the surrounding text) and your current position in
 346 the preedit buffer is between "d" and "e", you get these values:
 347
 348 @verbatim
 349   @-3 -- ?B
 350   @-2 -- ?C
 351   @-1 -- ?d
 352   @+1 -- ?e
 353   @+2 -- ?f
 354   @+3 -- ?G
 355 @endverbatim
 356
 357 Next, you have to understand the conditional action of this form:
 358
 359 @verbatim
 360   (cond
 361     (EXPR1 ACTION ACTION ...)
 362     (EXPR2 ACTION ACTION ...)
 363     ...)
 364 @endverbatim
 365
 366 where EXPRn are expressions.  When an input method executes this
 367 action, it resolves the values of EXPRn one by one from the first branch.
 368 If the value of EXPRn is resolved into nonzero, the corresponding
 369 actions are executed.
 370
 371 Now you are ready to write a new version of the input method "Titlecase".
 372
 373 @verbatim
 374 (input-method en titlecase2)
 375 (description (_ "Titlecase letters"))
 376 (title "abc->Abc")
 377 (map
 378   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 379            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 380            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 381            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 382            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 383            ("z" "Z") ("ii" "İ")))
 384 (state
 385   (init
 386     (toupper
 387
 388      ;; Now we have exactly one uppercase character in the preedit
 389      ;; buffer.  So, "@-2" is the character just before the inputting
 390      ;; spot.
 391
 392      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 393                (& (>= @-2 ?a) (<= @-2 ?z))
 394                (= @-2 ?İ))
 395
 396             ;; If the character before the inputting spot is A..Z,
 397             ;; a..z, or İ, remember the only character in the preedit
 398             ;; buffer in the variable X and delete it.
 399
 400             (set X @-1) (delete @-)
 401
 402             ;; Then insert the lowercase version of X.
 403
 404             (cond ((= X ?İ) "i")
 405                   (1 (set X (+ X 32)) (insert X))))))))
 406 @endverbatim
 407
 408 The above example contains the new action <tt>delete</tt>.  So, it is time
 409 to explain more about the preedit buffer.  The preedit buffer is a
 410 temporary place to store a sequence of characters.  In this buffer,
 411 the input method keeps a position called the "current position".  The
 412 current position exists between two characters, at the beginning of
 413 the buffer, or at the end of the buffer.  The <tt>insert</tt> action inserts
 414 characters before the current position.  For instance, when your
 415 preedit buffer contains "ab.c" ("." indicates the current position),
 416
 417 @verbatim
 418   (insert "xyz")
 419 @endverbatim
 420
 421 changes the buffer to "abxyz.c".
 422
 423 There are several predefined variables that represent a specific position in the
 424 preedit buffer.  They are:
 425
 426 <ul>
 427 <li> <tt>@@<, @@=, @@></tt>
 428
 429 The first, current, and last positions.
 430
 431 <li> <tt>@@-, @@+</tt>
 432
 433 The previous and the next positions.
 434 </ul>
 435
 436 The format of the <tt>delete</tt> action is this:
 437
 438 @verbatim
 439   (delete POS)
 440 @endverbatim
 441
 442 where POS is a predefined positional variable.
 443 The above action deletes the characters between POS and
 444 the current position.  So, <tt>(delete @-)</tt> deletes one character before
 445 the current position.  The other examples of <tt>delete</tt> include the followings:
 446
 447 @verbatim
 448   (delete @+)  ; delete the next character
 449   (delete @<)  ; delete all the preceding characters in the buffer
 450   (delete @>)  ; delete all the following characters in the buffer
 451 @endverbatim
 452
 453 You can change the current position using the <tt>move</tt> action as below:
 454
 455 @verbatim
 456   (move @-)  ; move the current position to the position before the
 457                previous character
 458   (move @<)  ; move to the first position
 459 @endverbatim
 460
 461 Other positional variables work similarly.
 462
 463 Let's see how our new example works.  Whatever a key event is, the
 464 input method is in its only state, <tt>init</tt>.  Since an event of a lower letter
 465 key is firstly handled by MAP-ACTIONs, every key is changed into the
 466 corresponding uppercase and put into the preedit buffer.  Now this character
 467 can be accessed with <tt>@-1</tt>.
 468
 469 How can we tell whether the new character should be a lowercase or an
 470 uppercase?  We can do so by checking the character before it, i.e.
 471 <tt>@-2</tt>.  BRANCH-ACTIONs in the <tt>init</tt> state do the job.
 472
 473 It first checks if the character <tt>@-2</tt> is between A to Z, between
 474 a to z, or İ by the conditional below.
 475
 476 @verbatim
 477      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 478                (& (>= @-2 ?a) (<= @-2 ?z))
 479                (= @-2 ?İ))
 480 @endverbatim
 481
 482 If not, there is nothing to do specially.  If so, our new key should
 483 be changed back into lowercase.  Since the uppercase character is
 484 already in the preedit buffer, we retrieve and remember it in the
 485 variable <tt>X</tt> by
 486
 487 @verbatim
 488     (set X @-1)
 489 @endverbatim
 490
 491 and then delete that character by
 492
 493 @verbatim
 494     (delete @-)
 495 @endverbatim
 496
 497 Lastly we re-insert the character in its lowercase form.  The
 498 problem here is that "İ" must be changed into "i", so we need another
 499 conditional.  The first branch
 500
 501 @verbatim
 502     ((= X ?İ) "i")
 503 @endverbatim
 504
 505 means that "if the character remembered in X is 'İ', 'i' is inserted".
 506
 507 The second branch
 508
 509 @verbatim
 510     (1 (set X (+ X 32)) (insert X))
 511 @endverbatim
 512
 513 starts with "1", which is always resolved into nonzero, so this branch
 514 is a catchall.  Actions in this branch increase <tt>X</tt> by 32, then
 515 insert <tt>X</tt>.  In other words, they change A...Z into a...z
 516 respectively and insert the resulting lowercase character into the
 517 preedit buffer.  As the input method reaches the end of the
 518 BRANCH-ACTIONs, the character is commited.
 519
 520 This new input method always checks the character before the current
 521 position, so "A Quick Blown Fox" will be successfully fixed to "A
 522 Quick Brown Fox" by the key sequence \<BackSpace\> \<r\>.
 523
 524
 525 */
 526
 527 /*
 528 Copyright (C) 2007
 529   National Institute of Advanced Industrial Science and Technology (AIST)
 530   Registration Number H15PRO112
 531
 532 This file is part of the m17n database; a sub-part of the m17n
 533 library.
 534
 535 The m17n library is free software; you can redistribute it and/or
 536 modify it under the terms of the GNU Lesser General Public License
 537 as published by the Free Software Foundation; either version 2.1 of
 538 the License, or (at your option) any later version.
 539
 540 The m17n library is distributed in the hope that it will be useful,
 541 but WITHOUT ANY WARRANTY; without even the implied warranty of
 542 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 543 Lesser General Public License for more details.
 544
 545 You should have received a copy of the GNU Lesser General Public
 546 License along with the m17n library; if not, write to the Free
 547 Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
 548 Boston, MA 02110-1301, USA.
 549 */