data/dbtutorial.txt

   1 /* -*- coding: utf-8; -*- */
   2 /*** @page m17nDBTutorial Tutorial for writing the m17n database
   3
   4 This section contains tutorials for writing various database files of
   5 the m17n database.
   6
   7 <ul>
   8 <li> @ref mdbTutorialIM "TutorialIM" -- Tutorial of input method
   9 </ul>
  10 */
  11 /***
  12
  13 @section mdbTutorialIM Tutorial of input method
  14
  15 @subsection im-struct Structure of an input method file
  16
  17 An input method is defined in a *.mim file with this format.
  18
  19 @verbatim
  20 (input-method LANG NAME)
  21
  22 (description (_ "DESCRIPTION"))
  23
  24 (title "TITLE-STRING")
  25
  26 (map
  27   (MAP-NAME
  28     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  29     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  30     ...)
  31   (MAP-NAME
  32     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  33     (KEYSEQ MAP-ACTION MAP-ACTION ...)        <- rule
  34     ...)
  35   ...)
  36
  37 (state
  38   (STATE-NAME
  39     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  40     ...)
  41   (STATE-NAME
  42     (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...)   <- branch
  43     ...)
  44   ...)
  45 @endverbatim
  46 Lowercase letters and parentheses are literals, so they must be
  47 written as they are.  Uppercase letters represent arbitrary strings.
  48
  49 KEYSEQ specifies a sequence of keys in this format:
  50 @verbatim
  51   (SYMBOLIC-KEY SYMBOLIC-KEY ...)
  52 @endverbatim
  53 where SYMBOLIC-KEY is the keysym value returned by the xev command.
  54 For instance
  55 @verbatim
  56   (n i)
  57 @endverbatim
  58 represents a key sequence of \<n\> and \<i\>.
  59 If all SYMBOLIC-KEYs are ASCII characters, you can use the short form
  60 @verbatim
  61   "ni"
  62 @endverbatim
  63 instead.  Consult #mdbIM for Non-ASCII characters.
  64
  65 Both MAP-ACTION and BRANCH-ACTION are a sequence of actions of this format:
  66 @verbatim
  67   (ACTION ARG ARG ...)
  68 @endverbatim
  69 The most common action is <span class="fragment">insert</span>, which is written as this:
  70 @verbatim
  71   (insert "TEXT")
  72 @endverbatim
  73 But as it is very frequently used, you can use the short form
  74 @verbatim
  75   "TEXT"
  76 @endverbatim
  77 If <span class="fragment">"TEXT"</span> contains only one character "C", you can write it as
  78 @verbatim
  79   (insert ?C)
  80 @endverbatim
  81 or even shorter as
  82 @verbatim
  83   ?C
  84 @endverbatim
  85 So the shortest notation for an action of inserting "a" is
  86 @verbatim
  87   ?a
  88 @endverbatim
  89
  90 @subsection im-upcase Simple example of capslock
  91
  92 Here is a simple example of an input method that works as CapsLock.
  93
  94 @verbatim
  95 (input-method en capslock)
  96 (description (_ "Upcase all lowercase letters"))
  97 (title "a->A")
  98 (map
  99   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 100            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 101            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 102            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 103            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 104            ("z" "Z")))
 105 (state
 106   (init (toupper)))
 107 @endverbatim
 108
 109 When this input method is activated, it is in the initial condition of
 110 the first state (in this case, the only state <span class="fragment">init</span>).  In the
 111 initial condition, no key is being processed and no action is
 112 suspended.  When the input method receives a key event \<a\>, it
 113 searches branches in the current state for a rule that matches \<a\>
 114 and finds one in the map <span class="fragment">toupper</span>.  Then it executes MAP-ACTIONs
 115 (in this case, just inserting "A" in the preedit buffer).  After all
 116 MAP-ACTIONs have been executed, the input method shifts to the initial
 117 condition of the current state.
 118
 119 The shift to <em>the initial condition of the first state</em> has a special
 120 meaning; it commits all characters in the preedit buffer then clears
 121 the preedit buffer.
 122
 123 As a result, "A" is given to the application program.
 124
 125 When a key event does not match with any rule in the current state,
 126 that event is unhandled and given back to the application program.
 127
 128 Turkish users may want to extend the above example for "İ" (U+0130:
 129 LATIN CAPITAL LETTER I WITH DOT ABOVE).  It seems that assigning the
 130 key sequence \<i\> \<i\> for that character is convenient.  So, he
 131 will add this rule in <span class="fragment">toupper</span>.
 132
 133 @verbatim
 134     ("ii" "İ")
 135 @endverbatim
 136
 137 However, we already have the following rule:
 138
 139 @verbatim
 140     ("i" "I")
 141 @endverbatim
 142
 143 What will happen when a key event \<i\> is sent to the input method?
 144
 145 No problem.  When the input method receives \<i\>, it inserts "I" in the
 146 preedit buffer.  It knows that there is another rule that may
 147 match the additional key event \<i\>.  So, after inserting "I", it
 148 suspends the normal behavior of shifting to the initial condition, and
 149 waits for another key.  Thus, the user sees "I" with underline, which
 150 indicates it is not yet committed.
 151
 152 When the input method receives the next \<i\>, it cancels the effects
 153 done by the rule for the previous "i" (in this case, the preedit buffer is
 154 cleared), and executes MAP-ACTIONs of the rule for "ii".  So, "İ" is
 155 inserted in the preedit buffer.  This time, as there are no other rules
 156 that match with an additional key, it shifts to the initial condition
 157 of the current state, which leads to commit "İ".
 158
 159 Then, what will happen when the next key event is \<a\> instead of \<i\>?
 160
 161 No problem, either.
 162
 163 The input method knows that there are no rules that match the \<i\> \<a\> key
 164 sequence.  So, when it receives the next \<a\>, it executes the
 165 suspended behavior (i.e. shifting to the initial condition), which
 166 leads to commit "I".  Then the input method tries to handle \<a\> in
 167 the current state, which leads to commit "A".
 168
 169 So far, we have explained MAP-ACTION, but not
 170 BRANCH-ACTION.  The format of BRANCH-ACTION is the same as that of MAP-ACTION.
 171 It is executed only after a matching rule has been determined and the
 172 corresponding MAP-ACTIONs have been executed.  A typical use of
 173 BRANCH-ACTION is to shift to a different state.
 174
 175 To see this effect, let us modify the current input method to upcase only
 176 word-initial letters (i.e. to capitalize).  For that purpose,
 177 we modify the "init" state as this:
 178
 179 @verbatim
 180   (init
 181     (toupper (shift non-upcase)))
 182 @endverbatim
 183
 184 Here <span class="fragment">(shift non-upcase)</span> is an action to shift to the new state
 185 <span class="fragment">non-upcase</span>, which has two branches as below:
 186
 187 @verbatim
 188   (non-upcase
 189     (lower)
 190     (nil (shift init)))
 191 @endverbatim
 192
 193 The first branch is simple.  We can define the new map <span class="fragment">lower</span> as the
 194 following to insert lowercase letters as they are.
 195
 196 @verbatim
 197 (map
 198   ...
 199   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 200          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 201          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 202          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 203          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 204          ("z" "z")))
 205 @endverbatim
 206
 207 The second branch has a special meaning.  The map name <span class="fragment">nil</span> means
 208 that it matches with any key event that does not match any rules in the
 209 other maps in the current state.  In addition, it does not
 210 consume any key event.  We will show the full code of the new input
 211 method before explaining how it works.
 212
 213 @verbatim
 214 (input-method en titlecase)
 215 (description (_ "Titlecase letters"))
 216 (title "abc->Abc")
 217 (map
 218   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 219            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 220            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 221            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 222            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 223            ("z" "Z") ("ii" "İ"))
 224   (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
 225          ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
 226          ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
 227          ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
 228          ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
 229          ("z" "z")))
 230 (state
 231   (init
 232     (toupper (shift non-upcase)))
 233   (non-upcase
 234     (lower (commit))
 235     (nil (shift init))))
 236 @endverbatim
 237
 238 Let's see what happens when the user types the key sequence \<a\> \<b\> \< \>.
 239 Upon \<a\>, "A" is committed and the state shifts to <span class="fragment">non-upcase</span>.
 240 So, the next \<b\> is handled in the <span class="fragment">non-upcase</span> state.
 241 As it matches a
 242 rule in the map <span class="fragment">lower</span>, "b" is inserted in the preedit buffer and it
 243 is committed explicitly by the "commit" command in BRANCH-ACTION.  After
 244 that, the input method is still in the <span class="fragment">non-upcase</span> state.  So the next \< \>
 245 is also handled in <span class="fragment">non-upcase</span>.  For this time, no rule in this state
 246 matches it.  Thus the branch <span class="fragment">(nil (shift init))</span> is selected and the
 247 state is shifted to <span class="fragment">init</span>.  Please note that \< \> is not yet
 248 handled because the map <span class="fragment">nil</span> does not consume any key event.
 249 So, the input method tries to handle it in the <span class="fragment">init</span> state.  Again no
 250 rule matches it.  Therefore, that event is given back to the application
 251 program, which usually inserts a space for that.
 252
 253 When you type "a quick blown fox" with this input method, you get "A
 254 Quick Blown Fox".  OK, you find a typo in "blown", which should be
 255 "brown".  To correct it, you probably move the cursor after "l" and type
 256 \<Backspace\> and \<r\>.  However, if the current input method is still
 257 active, a capital "R" is inserted.  It is not a sophisticated
 258 behavior.
 259
 260 @subsection im-surrounding-text Example of utilizing surrounding text support
 261
 262 To make the input method work well also in such a case, we must use
 263 "surrounding text support".  It is a way to check characters around
 264 the inputting spot and delete them if necessary.  Note that
 265 this facility is available only with Gtk+ applications and Qt
 266 applications.  You cannot use it with applications that use XIM
 267 to communicate with an input method.
 268
 269 Before explaining how to utilize "surrounding text support", you must
 270 understand how to use variables, arithmetic comparisons, and
 271 conditional actions.
 272
 273 At first, any symbol (except for several preserved ones) used as ARG
 274 of an action is treated as a variable.  For instance, the commands
 275
 276 @verbatim
 277   (set X 32) (insert X)
 278 @endverbatim
 279
 280 set the variable <span class="fragment">X</span> to integer value 32, then insert a character
 281 whose Unicode character code is 32 (i.e. SPACE).
 282
 283 The second argument of the <span class="fragment">set</span> action can be an expression of this form:
 284
 285 @verbatim
 286   (OPERAND ARG1 [ARG2])
 287 @endverbatim
 288
 289 Both ARG1 and ARG2 can be an expression.  So,
 290
 291 @verbatim
 292   (set X (+ (* Y 32) Z))
 293 @endverbatim
 294
 295 sets <span class="fragment">X</span> to the value of <span class="fragment">Y * 32 + Z</span>.
 296
 297 We have the following arithmetic/bitwise OPERANDs (require two arguments):
 298
 299 @verbatim
 300   + - * / & |
 301 @endverbatim
 302
 303 these relational OPERANDs (require two arguments):
 304
 305 @verbatim
 306   == <= >= < >
 307 @endverbatim
 308
 309 and this logical OPERAND (requires one argument):
 310
 311 @verbatim
 312   !
 313 @endverbatim
 314
 315 For surrounding text support, we have these preserved variables:
 316
 317 @verbatim
 318   @-0, @-N, @+N (N is a positive integer)
 319 @endverbatim
 320
 321 The values of them are predefined as below and can not be altered.
 322
 323 <ul>
 324 <li> <span class="fragment">@-0</span>
 325
 326 -1 if surrounding text is supported, -2 if not.
 327
 328 <li> <span class="fragment">@-N</span>
 329
 330 The Nth previous character in the preedit buffer.  If there are only M
 331 (M<N) previous characters in it, the value is the (N-M)th previous
 332 character from the inputting spot.
 333
 334 <li> <span class="fragment">@+N</span>
 335
 336 The Nth following character in the preedit buffer.  If there are only M
 337 (M<N) following characters in it, the value is the (N-M)th following
 338 character from the inputting spot.
 339
 340 </ul>
 341
 342 So, provided that you have this context:
 343
 344 @verbatim
 345   ABC|def|GHI
 346 @endverbatim
 347
 348 ("def" is in the preedit buffer, two "|"s indicate borders between the
 349 preedit buffer and the surrounding text) and your current position in
 350 the preedit buffer is between "d" and "e", you get these values:
 351
 352 @verbatim
 353   @-3 -- ?B
 354   @-2 -- ?C
 355   @-1 -- ?d
 356   @+1 -- ?e
 357   @+2 -- ?f
 358   @+3 -- ?G
 359 @endverbatim
 360
 361 Next, you have to understand the conditional action of this form:
 362
 363 @verbatim
 364   (cond
 365     (EXPR1 ACTION ACTION ...)
 366     (EXPR2 ACTION ACTION ...)
 367     ...)
 368 @endverbatim
 369
 370 where EXPRn are expressions.  When an input method executes this
 371 action, it resolves the values of EXPRn one by one from the first branch.
 372 If the value of EXPRn is resolved into nonzero, the corresponding
 373 actions are executed.
 374
 375 Now you are ready to write a new version of the input method "Titlecase".
 376
 377 @verbatim
 378 (input-method en titlecase2)
 379 (description (_ "Titlecase letters"))
 380 (title "abc->Abc")
 381 (map
 382   (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
 383            ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
 384            ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
 385            ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
 386            ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
 387            ("z" "Z") ("ii" "İ")))
 388 (state
 389   (init
 390     (toupper
 391
 392      ;; Now we have exactly one uppercase character in the preedit
 393      ;; buffer.  So, "@-2" is the character just before the inputting
 394      ;; spot.
 395
 396      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 397                (& (>= @-2 ?a) (<= @-2 ?z))
 398                (= @-2 ?İ))
 399
 400             ;; If the character before the inputting spot is A..Z,
 401             ;; a..z, or İ, remember the only character in the preedit
 402             ;; buffer in the variable X and delete it.
 403
 404             (set X @-1) (delete @-)
 405
 406             ;; Then insert the lowercase version of X.
 407
 408             (cond ((= X ?İ) "i")
 409                   (1 (set X (+ X 32)) (insert X))))))))
 410 @endverbatim
 411
 412 The above example contains the new action <span class="fragment">delete</span>.  So, it is time
 413 to explain more about the preedit buffer.  The preedit buffer is a
 414 temporary place to store a sequence of characters.  In this buffer,
 415 the input method keeps a position called the "current position".  The
 416 current position exists between two characters, at the beginning of
 417 the buffer, or at the end of the buffer.  The <span class="fragment">insert</span> action inserts
 418 characters before the current position.  For instance, when your
 419 preedit buffer contains "ab.c" ("." indicates the current position),
 420
 421 @verbatim
 422   (insert "xyz")
 423 @endverbatim
 424
 425 changes the buffer to "abxyz.c".
 426
 427 There are several predefined variables that represent a specific position in the
 428 preedit buffer.  They are:
 429
 430 <ul>
 431 <li> <span class="fragment">@@<, @=, @@></span>
 432
 433 The first, current, and last positions.
 434
 435 <li> <span class="fragment">@-, @+</span>
 436
 437 The previous and the next positions.
 438 </ul>
 439
 440 The format of the <span class="fragment">delete</span> action is this:
 441
 442 @verbatim
 443   (delete POS)
 444 @endverbatim
 445
 446 where POS is a predefined positional variable.
 447 The above action deletes the characters between POS and
 448 the current position.  So, <span class="fragment">(delete @-)</span> deletes one character before
 449 the current position.  The other examples of <span class="fragment">delete</span> include the followings:
 450
 451 @verbatim
 452   (delete @+)  ; delete the next character
 453   (delete @<)  ; delete all the preceding characters in the buffer
 454   (delete @>)  ; delete all the following characters in the buffer
 455 @endverbatim
 456
 457 You can change the current position using the <span class="fragment">move</span> action as below:
 458
 459 @verbatim
 460   (move @-)  ; move the current position to the position before the
 461                previous character
 462   (move @<)  ; move to the first position
 463 @endverbatim
 464
 465 Other positional variables work similarly.
 466
 467 Let's see how our new example works.  Whatever a key event is, the
 468 input method is in its only state, <span class="fragment">init</span>.  Since an event of a lower letter
 469 key is firstly handled by MAP-ACTIONs, every key is changed into the
 470 corresponding uppercase and put into the preedit buffer.  Now this character
 471 can be accessed with <span class="fragment">@-1</span>.
 472
 473 How can we tell whether the new character should be a lowercase or an
 474 uppercase?  We can do so by checking the character before it, i.e.
 475 <span class="fragment">@-2</span>.  BRANCH-ACTIONs in the <span class="fragment">init</span> state do the job.
 476
 477 It first checks if the character <span class="fragment">@-2</span> is between A to Z, between
 478 a to z, or İ by the conditional below.
 479
 480 @verbatim
 481      (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
 482                (& (>= @-2 ?a) (<= @-2 ?z))
 483                (= @-2 ?İ))
 484 @endverbatim
 485
 486 If not, there is nothing to do specially.  If so, our new key should
 487 be changed back into lowercase.  Since the uppercase character is
 488 already in the preedit buffer, we retrieve and remember it in the
 489 variable <span class="fragment">X</span> by
 490
 491 @verbatim
 492     (set X @-1)
 493 @endverbatim
 494
 495 and then delete that character by
 496
 497 @verbatim
 498     (delete @-)
 499 @endverbatim
 500
 501 Lastly we re-insert the character in its lowercase form.  The
 502 problem here is that "İ" must be changed into "i", so we need another
 503 conditional.  The first branch
 504
 505 @verbatim
 506     ((= X ?İ) "i")
 507 @endverbatim
 508
 509 means that "if the character remembered in X is 'İ', 'i' is inserted".
 510
 511 The second branch
 512
 513 @verbatim
 514     (1 (set X (+ X 32)) (insert X))
 515 @endverbatim
 516
 517 starts with "1", which is always resolved into nonzero, so this branch
 518 is a catchall.  Actions in this branch increase <span class="fragment">X</span> by 32, then
 519 insert <span class="fragment">X</span>.  In other words, they change A...Z into a...z
 520 respectively and insert the resulting lowercase character into the
 521 preedit buffer.  As the input method reaches the end of the
 522 BRANCH-ACTIONs, the character is commited.
 523
 524 This new input method always checks the character before the current
 525 position, so "A Quick Blown Fox" will be successfully fixed to "A
 526 Quick Brown Fox" by the key sequence \<BackSpace\> \<r\>.
 527
 528
 529 */