FORMATS/IM-tut.txt

   1 /* Copyright (C) 2007
   2      National Institute of Advanced Industrial Science and Technology (AIST)
   3      Registration Number H15PRO112
   4    See the end for copying conditions.  */
   5
   6 /***
   7
   8 @htmlonly
   9 <style type="text/css">
  10 <!--
  11 .red { color:red }
  12 -->
  13 @endhtmlonly
  14
  15 @page mdbTutorialIM Tutorial of input method
  16
  17 @section im-struct Structure of an input method file
  18
  19 An input method is defined in a *.mim file with this format.
  20
  21 @verbatim
  22 (input-method LANG NAME)
  23
  24 (description (_ "DESCRIPTION"))
  25
  26 (title "TITLE-STRING")
  27
  28 (map
  29   (MAP-NAME
  30     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  31     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  32     ...)
  33   (MAP-NAME
  34     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  35     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  36     ...)
  37   ...)
  38
  39 (state
  40   (STATE-NAME
  41     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  42     ...)
  43   (STATE-NAME
  44     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  45     ...)
  46   ...)
  47 @endverbatim
  48 Lowercase letters and parentheses are literals, so they must be
  49 written as they are.  Uppercase letters represent arbitrary strings.
  50
  51 KEYSEQ specifies a sequence of keys in this format:
  52 @verbatim
  53   (SYMBOLIC-KEY SYMBOLIC-KEY ...)
  54 @endverbatim
  55 where SYMBOLIC-KEY is the keysym value returned by the xev command.
  56 For instance
  57 @verbatim
  58   (n i)
  59 @endverbatim
  60 represents a key sequence of <<n>> and <<i>>.
  61 If all SYMBOLIC-KEYs are ASCII characters, you can use the short form
  62 @verbatim
  63   "ni"
  64 @endverbatim
  65 instead.  Consult #mdbIM for Non-ASCII characters.
  66
  67 Both MAP-ACTION and BRANCH-ACTION are a sequence of actions of this format:
  68 @verbatim
  69   (ACTION ARG ARG ...)
  70 @endverbatim
  71 The most common action is [[insert]], which is written as this:
  72 @verbatim
  73   (insert "TEXT")
  74 @endverbatim
  75 But as it is very frequently used, you can use the short form
  76 @verbatim
  77   "TEXT"
  78 @endverbatim
  79 If [["TEXT"]] contains only one character "C", you can write it as
  80 @verbatim
  81   (insert ?C)
  82 @endverbatim
  83 or even shorter as
  84 @verbatim
  85   ?C
  86 @endverbatim
  87 So the shortest notation for an action of inserting "a" is
  88 @verbatim
  89   ?a
  90 @endverbatim
  91
  92 @section im-upcase Simple example of capslock
  93
  94 Here is a simple example of an input method that works as CapsLock.
  95
  96 @verbatim
  97 (input-method en capslock)
  98 (description (_ "Upcase all lowercase letters"))
  99 (title "a->A")
 100 (map
 101   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 102            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 103            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 104            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 105            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 106            ("z" "Z")))
 107 (state
 108   (init (toupper)))
 109 @endverbatim
 110
 111 When this input method is activated, it is in the initial condition of
 112 the first state (in this case, the only state [[init]]).  In the
 113 initial condition, no key is being processed and no action is
 114 suspended.  When the input method receives a key event <<a>>, it
 115 searches branches in the current state for a rule that matches <<a>>
 116 and finds one in the map [[toupper]].  Then it executes MAP-ACTIONs
 117 (in this case, just inserting "A" in the preedit buffer).  After all
 118 MAP-ACTIONs have been executed, the input method shifts to the initial
 119 condition of the current state.
 120
 121 The shift to <em>the initial condition of the first state</em> has a special
 122 meaning; it commits all characters in the preedit buffer then clears
 123 the preedit buffer.
 124
 125 As a result, "A" is given to the application program.
 126
 127 When a key event does not match with any rule in the current state,
 128 that event is unhandled and given back to the application program.
 129
 130 Turkish users may want to extend the above example for "İ" (U+0130:
 131 LATIN CAPITAL LETTER I WITH DOT ABOVE).  It seems that assigning the
 132 key sequence <<i>> <<i>> for that character is convenient.  So, he
 133 will add this rule in [[toupper]].
 134
 135 @verbatim
 136     ("ii" "İ")
 137 @endverbatim
 138
 139 However, we already have the following rule:
 140
 141 @verbatim
 142     ("i" "I")
 143 @endverbatim
 144
 145 What will happen when a key event <<i>> is sent to the input method?
 146
 147 No problem.  When the input method receives <<i>>, it inserts "I" in the
 148 preedit buffer.  It knows that there is another rule that may
 149 match the additional key event <<i>>.  So, after inserting "I", it
 150 suspends the normal behavior of shifting to the initial condition, and
 151 waits for another key.  Thus, the user sees "I" with underline, which
 152 indicates it is not yet committed.
 153
 154 When the input method receives the next <<i>>, it cancels the effects
 155 done by the rule for the previous "i" (in this case, the preedit buffer is
 156 cleared), and executes MAP-ACTIONs of the rule for "ii".  So, "İ" is
 157 inserted in the preedit buffer.  This time, as there are no other rules
 158 that match with an additional key, it shifts to the initial condition
 159 of the current state, which leads to commit "İ".
 160
 161 Then, what will happen when the next key event is <<a>> instead of <<i>>?
 162
 163 No problem, either.
 164
 165 The input method knows that there are no rules that match the <<i>> <<a>> key
 166 sequence.  So, when it receives the next <<a>>, it executes the
 167 suspended behavior (i.e. shifting to the initial condition), which
 168 leads to commit "I".  Then the input method tries to handle <<a>> in
 169 the current state, which leads to commit "A".
 170
 171 So far, we have explained MAP-ACTION, but not
 172 BRANCH-ACTION.  The format of BRANCH-ACTION is the same as that of MAP-ACTION.
 173 It is executed only after a matching rule has been determined and the
 174 corresponding MAP-ACTIONs have been executed.  A typical use of
 175 BRANCH-ACTION is to shift to a different state.
 176
 177 To see this effect, let us modify the current input method to upcase only
 178 word-initial letters (i.e. to capitalize).  For that purpose,
 179 we modify the "init" state as this:
 180
 181 @verbatim
 182   (init
 183     (toupper (shift non-upcase)))
 184 @endverbatim
 185
 186 Here [[(shift non-upcase)]] is an action to shift to the new state
 187 [[non-upcase]], which has two branches as below:
 188
 189 @verbatim
 190   (non-upcase
 191     (lower)
 192     (nil (shift init)))
 193 @endverbatim
 194
 195 The first branch is simple.  We can define the new map [[lower]] as the
 196 following to insert lowercase letters as they are.
 197
 198 @verbatim
 199 (map
 200   ...
 201   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 202          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 203          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 204          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 205          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 206          ("z" "z")))
 207 @endverbatim
 208
 209 The second branch has a special meaning.  The map name [[nil]] means
 210 that it matches with any key event that does not match any rules in the
 211 other maps in the current state.  In addition, it does not
 212 consume any key event.  We will show the full code of the new input
 213 method before explaining how it works.
 214
 215 @verbatim
 216 (input-method en titlecase)
 217 (description (_ "Titlecase letters"))
 218 (title "abc->Abc")
 219 (map
 220   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 221            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 222            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 223            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 224            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 225            ("z" "Z") ("ii" "İ"))
 226   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 227          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 228          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 229          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 230          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 231          ("z" "z")))
 232 (state
 233   (init
 234     (toupper (shift non-upcase)))
 235   (non-upcase
 236     (lower (commit))
 237     (nil (shift init))))
 238 @endverbatim
 239
 240 Let's see what happens when the user types the key sequence <<a>> <<b>> << >>.
 241 Upon <<a>>, "A" is committed and the state shifts to [[non-upcase]].
 242 So, the next <<b>> is handled in the [[non-upcase]] state.
 243 As it matches a
 244 rule in the map [[lower]], "b" is inserted in the preedit buffer and it
 245 is committed explicitly by the "commit" command in BRANCH-ACTION.  After
 246 that, the input method is still in the [[non-upcase]] state.  So the next << >>
 247 is also handled in [[non-upcase]].  For this time, no rule in this state
 248 matches it.  Thus the branch [[(nil (shift init))]] is selected and the
 249 state is shifted to [[init]].  Please note that << >> is not yet
 250 handled because the map [[nil]] does not consume any key event.
 251 So, the input method tries to handle it in the [[init]] state.  Again no
 252 rule matches it.  Therefore, that event is given back to the application
 253 program, which usually inserts a space for that.
 254
 255 When you type "a quick blown fox" with this input method, you get "A
 256 Quick Blown Fox".  OK, you find a typo in "blown", which should be
 257 "brown".  To correct it, you probably move the cursor after "l" and type
 258 <<Backspace>> and <<r>>.  However, if the current input method is still
 259 active, a capital "R" is inserted.  It is not a sophisticated
 260 behavior.
 261
 262 @section im-surrounding-text Example of utilizing surrounding text support
 263
 264 To make the input method work well also in such a case, we must use
 265 "surrounding text support".  It is a way to check characters around
 266 the inputting spot and delete them if necessary.  Note that
 267 this facility is available only with Gtk+ applications and Qt
 268 applications.  You cannot use it with applications that use XIM
 269 to communicate with an input method.
 270
 271 Before explaining how to utilize "surrounding text support", you must
 272 understand how to use variables, arithmetic comparisons, and
 273 conditional actions.
 274
 275 At first, any symbol (except for several preserved ones) used as ARG
 276 of an action is treated as a variable.  For instance, the commands
 277
 278 @verbatim
 279   (set X 32) (insert X)
 280 @endverbatim
 281
 282 set the variable [[X]] to integer value 32, then insert a character
 283 whose Unicode character code is 32 (i.e. SPACE).
 284
 285 The second argument of the [[set]] action can be an expression of this form:
 286
 287 @verbatim
 288   (OPERAND ARG1 [ARG2])
 289 @endverbatim
 290
 291 Both ARG1 and ARG2 can be an expression.  So,
 292
 293 @verbatim
 294   (set X (+ (* Y 32) Z))
 295 @endverbatim
 296
 297 sets [[X]] to the value of [[Y * 32 + Z]].
 298
 299 We have the following arithmetic/bitwise OPERANDs (require two arguments):
 300
 301 @verbatim
 302   + - * / & |
 303 @endverbatim
 304
 305 these relational OPERANDs (require two arguments):
 306
 307 @verbatim
 308   == <= >= < >
 309 @endverbatim
 310
 311 and this logical OPERAND (requires one argument):
 312
 313 @verbatim
 314   !
 315 @endverbatim
 316
 317 For surrounding text support, we have these preserved variables:
 318
 319 @verbatim
 320   @-0, @-N, @+N (N is a positive integer)
 321 @endverbatim
 322
 323 The values of them are predefined as below and can not be altered.
 324
 325 <ul>
 326 <li> [[@-0]]
 327
 328 -1 if surrounding text is supported, -2 if not.
 329
 330 <li> [[@-N]]
 331
 332 The Nth previous character in the preedit buffer.  If there are only M
 333 (M<N) previous characters in it, the value is the (N-M)th previous
 334 character from the inputting spot.
 335
 336 <li> [[@+N]]
 337
 338 The Nth following character in the preedit buffer.  If there are only M
 339 (M<N) following characters in it, the value is the (N-M)th following
 340 character from the inputting spot.
 341
 342 </ul>
 343
 344 So, provided that you have this context:
 345
 346 @verbatim
 347   ABC|def|GHI
 348 @endverbatim
 349
 350 ("def" is in the preedit buffer, two "|"s indicate borders between the
 351 preedit buffer and the surrounding text) and your current position in
 352 the preedit buffer is between "d" and "e", you get these values:
 353
 354 @verbatim
 355   @-3 -- ?B
 356   @-2 -- ?C
 357   @-1 -- ?d
 358   @+1 -- ?e
 359   @+2 -- ?f
 360   @+3 -- ?G
 361 @endverbatim
 362
 363 Next, you have to understand the conditional action of this form:
 364
 365 @verbatim
 366   (cond
 367     (EXPR1 ACTION ACTION ...)
 368     (EXPR2 ACTION ACTION ...)
 369     ...)
 370 @endverbatim
 371
 372 where EXPRn are expressions.  When an input method executes this
 373 action, it resolves the values of EXPRn one by one from the first branch.
 374 If the value of EXPRn is resolved into nonzero, the corresponding
 375 actions are executed.
 376
 377 Now you are ready to write a new version of the input method "Titlecase".
 378
 379 @verbatim
 380 (input-method en titlecase2)
 381 (description (_ "Titlecase letters"))
 382 (title "abc->Abc")
 383 (map
 384   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 385            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 386            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 387            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 388            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 389            ("z" "Z") ("ii" "İ")))
 390 (state
 391   (init
 392     (toupper
 393
 394      ;; Now we have exactly one uppercase character in the preedit
 395      ;; buffer.  So, "@-2" is the character just before the inputting
 396      ;; spot.
 397
 398      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 399                (& (>= @-2 ?a) (<= @-2 ?z))
 400                (= @-2 ?İ))
 401
 402             ;; If the character before the inputting spot is A..Z,
 403             ;; a..z, or İ, remember the only character in the preedit
 404             ;; buffer in the variable X and delete it.
 405
 406             (set X @-1) (delete @-)
 407
 408             ;; Then insert the lowercase version of X.
 409
 410             (cond ((= X ?İ) "i")
 411                   (1 (set X (+ X 32)) (insert X))))))))
 412 @endverbatim
 413
 414 The above example contains the new action [[delete]].  So, it is time
 415 to explain more about the preedit buffer.  The preedit buffer is a
 416 temporary place to store a sequence of characters.  In this buffer,
 417 the input method keeps a position called the "current position".  The
 418 current position exists between two characters, at the beginning of
 419 the buffer, or at the end of the buffer.  The [[insert]] action inserts
 420 characters before the current position.  For instance, when your
 421 preedit buffer contains "ab.c" ("." indicates the current position),
 422
 423 @verbatim
 424   (insert "xyz")
 425 @endverbatim
 426
 427 changes the buffer to "abxyz.c".
 428
 429 There are several predefined variables that represent a specific position in the
 430 preedit buffer.  They are:
 431
 432 <ul>
 433 <li> [[@<, @=, @>]]
 434
 435 The first, current, and last positions.
 436
 437 <li> [[@-, @+]]
 438
 439 The previous and the next positions.
 440 </ul>
 441
 442 The format of the [[delete]] action is this:
 443
 444 @verbatim
 445   (delete POS)
 446 @endverbatim
 447
 448 where POS is a predefined positional variable.
 449 The above action deletes the characters between POS and
 450 the current position.  So, [[(delete @-)]] deletes one character before
 451 the current position.  The other examples of [[delete]] include the followings:
 452
 453 @verbatim
 454   (delete @+)  ; delete the next character
 455   (delete @<)  ; delete all the preceding characters in the buffer
 456   (delete @>)  ; delete all the following characters in the buffer
 457 @endverbatim
 458
 459 You can change the current position using the [[move]] action as below:
 460
 461 @verbatim
 462   (move @-)  ; move the current position to the position before the
 463                previous character
 464   (move @<)  ; move to the first position
 465 @endverbatim
 466
 467 Other positional variables work similarly.
 468
 469 Let's see how our new example works.  Whatever a key event is, the
 470 input method is in its only state, [[init]].  Since an event of a lower letter
 471 key is firstly handled by MAP-ACTIONs, every key is changed into the
 472 corresponding uppercase and put into the preedit buffer.  Now this character
 473 can be accessed with [[@-1]].
 474
 475 How can we tell whether the new character should be a lowercase or an
 476 uppercase?  We can do so by checking the character before it, i.e.
 477 [[@-2]].  BRANCH-ACTIONs in the [[init]] state do the job.
 478
 479 It first checks if the character [[@-2]] is between A to Z, between
 480 a to z, or İ by the conditional below.
 481
 482 @verbatim
 483      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 484                (& (>= @-2 ?a) (<= @-2 ?z))
 485                (= @-2 ?İ))
 486 @endverbatim
 487
 488 If not, there is nothing to do specially.  If so, our new key should
 489 be changed back into lowercase.  Since the uppercase character is
 490 already in the preedit buffer, we retrieve and remember it in the
 491 variable [[X]] by
 492
 493 @verbatim
 494     (set X @-1)
 495 @endverbatim
 496
 497 and then delete that character by
 498
 499 @verbatim
 500     (delete @-)
 501 @endverbatim
 502
 503 Lastly we re-insert the character in its lowercase form.  The
 504 problem here is that "İ" must be changed into "i", so we need another
 505 conditional.  The first branch
 506
 507 @verbatim
 508     ((= X ?İ) "i")
 509 @endverbatim
 510
 511 means that "if the character remembered in X is 'İ', 'i' is inserted".
 512
 513 The second branch
 514
 515 @verbatim
 516     (1 (set X (+ X 32)) (insert X))
 517 @endverbatim
 518
 519 starts with "1", which is always resolved into nonzero, so this branch
 520 is a catchall.  Actions in this branch increase [[X]] by 32, then
 521 insert [[X]].  In other words, they change A...Z into a...z
 522 respectively and insert the resulting lowercase character into the
 523 preedit buffer.  As the input method reaches the end of the
 524 BRANCH-ACTIONs, the character is commited.
 525
 526 This new input method always checks the character before the current
 527 position, so "A Quick Blown Fox" will be successfully fixed to "A
 528 Quick Brown Fox" by the key sequence <<BackSpace>> <<r>>.
 529
 530
 531 */
 532
 533 /*
 534 Copyright (C) 2007
 535   National Institute of Advanced Industrial Science and Technology (AIST)
 536   Registration Number H15PRO112
 537
 538 This file is part of the m17n database; a sub-part of the m17n
 539 library.
 540
 541 The m17n library is free software; you can redistribute it and/or
 542 modify it under the terms of the GNU Lesser General Public License
 543 as published by the Free Software Foundation; either version 2.1 of
 544 the License, or (at your option) any later version.
 545
 546 The m17n library is distributed in the hope that it will be useful,
 547 but WITHOUT ANY WARRANTY; without even the implied warranty of
 548 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 549 Lesser General Public License for more details.
 550
 551 You should have received a copy of the GNU Lesser General Public
 552 License along with the m17n library; if not, write to the Free
 553 Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
 554 Boston, MA 02110-1301, USA.
 555 */