1 /* -*- coding: utf-8; -*- */
2 /*** @page m17nDBTutorial Tutorial for writing the m17n database
4 This section contains tutorials for writing various database files of
8 <li> @ref mdbTutorialIM "TutorialIM" -- Tutorial of input method
13 @section mdbTutorialIM Tutorial of input method
15 @subsection im-struct Structure of an input method file
17 An input method is defined in a *.mim file with this format.
20 (input-method LANG NAME)
22 (description (_ "DESCRIPTION"))
24 (title "TITLE-STRING")
28 (KEYSEQ MAP-ACTION MAP-ACTION ...) <- rule
29 (KEYSEQ MAP-ACTION MAP-ACTION ...) <- rule
32 (KEYSEQ MAP-ACTION MAP-ACTION ...) <- rule
33 (KEYSEQ MAP-ACTION MAP-ACTION ...) <- rule
39 (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...) <- branch
42 (MAP-NAME BRANCH-ACTION BRANCH-ACTION ...) <- branch
46 Lowercase letters and parentheses are literals, so they must be
47 written as they are. Uppercase letters represent arbitrary strings.
49 KEYSEQ specifies a sequence of keys in this format:
51 (SYMBOLIC-KEY SYMBOLIC-KEY ...)
53 where SYMBOLIC-KEY is the keysym value returned by the xev command.
58 represents a key sequence of \<n\> and \<i\>.
59 If all SYMBOLIC-KEYs are ASCII characters, you can use the short form
63 instead. Consult #mdbIM for Non-ASCII characters.
65 Both MAP-ACTION and BRANCH-ACTION are a sequence of actions of this format:
69 The most common action is <span class="fragment">insert</span>, which is written as this:
73 But as it is very frequently used, you can use the short form
77 If <span class="fragment">"TEXT"</span> contains only one character "C", you can write it as
85 So the shortest notation for an action of inserting "a" is
90 @subsection im-upcase Simple example of capslock
92 Here is a simple example of an input method that works as CapsLock.
95 (input-method en capslock)
96 (description (_ "Upcase all lowercase letters"))
99 (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
100 ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
101 ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
102 ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
103 ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
109 When this input method is activated, it is in the initial condition of
110 the first state (in this case, the only state <span class="fragment">init</span>). In the
111 initial condition, no key is being processed and no action is
112 suspended. When the input method receives a key event \<a\>, it
113 searches branches in the current state for a rule that matches \<a\>
114 and finds one in the map <span class="fragment">toupper</span>. Then it executes MAP-ACTIONs
115 (in this case, just inserting "A" in the preedit buffer). After all
116 MAP-ACTIONs have been executed, the input method shifts to the initial
117 condition of the current state.
119 The shift to <em>the initial condition of the first state</em> has a special
120 meaning; it commits all characters in the preedit buffer then clears
123 As a result, "A" is given to the application program.
125 When a key event does not match with any rule in the current state,
126 that event is unhandled and given back to the application program.
128 Turkish users may want to extend the above example for "İ" (U+0130:
129 LATIN CAPITAL LETTER I WITH DOT ABOVE). It seems that assigning the
130 key sequence \<i\> \<i\> for that character is convenient. So, he
131 will add this rule in <span class="fragment">toupper</span>.
137 However, we already have the following rule:
143 What will happen when a key event \<i\> is sent to the input method?
145 No problem. When the input method receives \<i\>, it inserts "I" in the
146 preedit buffer. It knows that there is another rule that may
147 match the additional key event \<i\>. So, after inserting "I", it
148 suspends the normal behavior of shifting to the initial condition, and
149 waits for another key. Thus, the user sees "I" with underline, which
150 indicates it is not yet committed.
152 When the input method receives the next \<i\>, it cancels the effects
153 done by the rule for the previous "i" (in this case, the preedit buffer is
154 cleared), and executes MAP-ACTIONs of the rule for "ii". So, "İ" is
155 inserted in the preedit buffer. This time, as there are no other rules
156 that match with an additional key, it shifts to the initial condition
157 of the current state, which leads to commit "İ".
159 Then, what will happen when the next key event is \<a\> instead of \<i\>?
163 The input method knows that there are no rules that match the \<i\> \<a\> key
164 sequence. So, when it receives the next \<a\>, it executes the
165 suspended behavior (i.e. shifting to the initial condition), which
166 leads to commit "I". Then the input method tries to handle \<a\> in
167 the current state, which leads to commit "A".
169 So far, we have explained MAP-ACTION, but not
170 BRANCH-ACTION. The format of BRANCH-ACTION is the same as that of MAP-ACTION.
171 It is executed only after a matching rule has been determined and the
172 corresponding MAP-ACTIONs have been executed. A typical use of
173 BRANCH-ACTION is to shift to a different state.
175 To see this effect, let us modify the current input method to upcase only
176 word-initial letters (i.e. to capitalize). For that purpose,
177 we modify the "init" state as this:
181 (toupper (shift non-upcase)))
184 Here <span class="fragment">(shift non-upcase)</span> is an action to shift to the new state
185 <span class="fragment">non-upcase</span>, which has two branches as below:
193 The first branch is simple. We can define the new map <span class="fragment">lower</span> as the
194 following to insert lowercase letters as they are.
199 (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
200 ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
201 ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
202 ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
203 ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
207 The second branch has a special meaning. The map name <span class="fragment">nil</span> means
208 that it matches with any key event that does not match any rules in the
209 other maps in the current state. In addition, it does not
210 consume any key event. We will show the full code of the new input
211 method before explaining how it works.
214 (input-method en titlecase)
215 (description (_ "Titlecase letters"))
218 (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
219 ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
220 ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
221 ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
222 ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
223 ("z" "Z") ("ii" "İ"))
224 (lower ("a" "a") ("b" "b") ("c" "c") ("d" "d") ("e" "e")
225 ("f" "f") ("g" "g") ("h" "h") ("i" "i") ("j" "j")
226 ("k" "k") ("l" "l") ("m" "m") ("n" "n") ("o" "o")
227 ("p" "p") ("q" "q") ("r" "r") ("s" "s") ("t" "t")
228 ("u" "u") ("v" "v") ("w" "w") ("x" "x") ("y" "y")
232 (toupper (shift non-upcase)))
238 Let's see what happens when the user types the key sequence \<a\> \<b\> \< \>.
239 Upon \<a\>, "A" is committed and the state shifts to <span class="fragment">non-upcase</span>.
240 So, the next \<b\> is handled in the <span class="fragment">non-upcase</span> state.
242 rule in the map <span class="fragment">lower</span>, "b" is inserted in the preedit buffer and it
243 is committed explicitly by the "commit" command in BRANCH-ACTION. After
244 that, the input method is still in the <span class="fragment">non-upcase</span> state. So the next \< \>
245 is also handled in <span class="fragment">non-upcase</span>. For this time, no rule in this state
246 matches it. Thus the branch <span class="fragment">(nil (shift init))</span> is selected and the
247 state is shifted to <span class="fragment">init</span>. Please note that \< \> is not yet
248 handled because the map <span class="fragment">nil</span> does not consume any key event.
249 So, the input method tries to handle it in the <span class="fragment">init</span> state. Again no
250 rule matches it. Therefore, that event is given back to the application
251 program, which usually inserts a space for that.
253 When you type "a quick blown fox" with this input method, you get "A
254 Quick Blown Fox". OK, you find a typo in "blown", which should be
255 "brown". To correct it, you probably move the cursor after "l" and type
256 \<Backspace\> and \<r\>. However, if the current input method is still
257 active, a capital "R" is inserted. It is not a sophisticated
260 @subsection im-surrounding-text Example of utilizing surrounding text support
262 To make the input method work well also in such a case, we must use
263 "surrounding text support". It is a way to check characters around
264 the inputting spot and delete them if necessary. Note that
265 this facility is available only with Gtk+ applications and Qt
266 applications. You cannot use it with applications that use XIM
267 to communicate with an input method.
269 Before explaining how to utilize "surrounding text support", you must
270 understand how to use variables, arithmetic comparisons, and
273 At first, any symbol (except for several preserved ones) used as ARG
274 of an action is treated as a variable. For instance, the commands
277 (set X 32) (insert X)
280 set the variable <span class="fragment">X</span> to integer value 32, then insert a character
281 whose Unicode character code is 32 (i.e. SPACE).
283 The second argument of the <span class="fragment">set</span> action can be an expression of this form:
286 (OPERAND ARG1 [ARG2])
289 Both ARG1 and ARG2 can be an expression. So,
292 (set X (+ (* Y 32) Z))
295 sets <span class="fragment">X</span> to the value of <span class="fragment">Y * 32 + Z</span>.
297 We have the following arithmetic/bitwise OPERANDs (require two arguments):
303 these relational OPERANDs (require two arguments):
309 and this logical OPERAND (requires one argument):
315 For surrounding text support, we have these preserved variables:
318 @-0, @-N, @+N (N is a positive integer)
321 The values of them are predefined as below and can not be altered.
324 <li> <span class="fragment">@-0</span>
326 -1 if surrounding text is supported, -2 if not.
328 <li> <span class="fragment">@-N</span>
330 The Nth previous character in the preedit buffer. If there are only M
331 (M<N) previous characters in it, the value is the (N-M)th previous
332 character from the inputting spot.
334 <li> <span class="fragment">@+N</span>
336 The Nth following character in the preedit buffer. If there are only M
337 (M<N) following characters in it, the value is the (N-M)th following
338 character from the inputting spot.
342 So, provided that you have this context:
348 ("def" is in the preedit buffer, two "|"s indicate borders between the
349 preedit buffer and the surrounding text) and your current position in
350 the preedit buffer is between "d" and "e", you get these values:
361 Next, you have to understand the conditional action of this form:
365 (EXPR1 ACTION ACTION ...)
366 (EXPR2 ACTION ACTION ...)
370 where EXPRn are expressions. When an input method executes this
371 action, it resolves the values of EXPRn one by one from the first branch.
372 If the value of EXPRn is resolved into nonzero, the corresponding
373 actions are executed.
375 Now you are ready to write a new version of the input method "Titlecase".
378 (input-method en titlecase2)
379 (description (_ "Titlecase letters"))
382 (toupper ("a" "A") ("b" "B") ("c" "C") ("d" "D") ("e" "E")
383 ("f" "F") ("g" "G") ("h" "H") ("i" "I") ("j" "J")
384 ("k" "K") ("l" "L") ("m" "M") ("n" "N") ("o" "O")
385 ("p" "P") ("q" "Q") ("r" "R") ("s" "S") ("t" "T")
386 ("u" "U") ("v" "V") ("w" "W") ("x" "X") ("y" "Y")
387 ("z" "Z") ("ii" "İ")))
392 ;; Now we have exactly one uppercase character in the preedit
393 ;; buffer. So, "@-2" is the character just before the inputting
396 (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
397 (& (>= @-2 ?a) (<= @-2 ?z))
400 ;; If the character before the inputting spot is A..Z,
401 ;; a..z, or İ, remember the only character in the preedit
402 ;; buffer in the variable X and delete it.
404 (set X @-1) (delete @-)
406 ;; Then insert the lowercase version of X.
409 (1 (set X (+ X 32)) (insert X))))))))
412 The above example contains the new action <span class="fragment">delete</span>. So, it is time
413 to explain more about the preedit buffer. The preedit buffer is a
414 temporary place to store a sequence of characters. In this buffer,
415 the input method keeps a position called the "current position". The
416 current position exists between two characters, at the beginning of
417 the buffer, or at the end of the buffer. The <span class="fragment">insert</span> action inserts
418 characters before the current position. For instance, when your
419 preedit buffer contains "ab.c" ("." indicates the current position),
425 changes the buffer to "abxyz.c".
427 There are several predefined variables that represent a specific position in the
428 preedit buffer. They are:
431 <li> <span class="fragment">@@<, @=, @@></span>
433 The first, current, and last positions.
435 <li> <span class="fragment">@-, @+</span>
437 The previous and the next positions.
440 The format of the <span class="fragment">delete</span> action is this:
446 where POS is a predefined positional variable.
447 The above action deletes the characters between POS and
448 the current position. So, <span class="fragment">(delete @-)</span> deletes one character before
449 the current position. The other examples of <span class="fragment">delete</span> include the followings:
452 (delete @+) ; delete the next character
453 (delete @<) ; delete all the preceding characters in the buffer
454 (delete @>) ; delete all the following characters in the buffer
457 You can change the current position using the <span class="fragment">move</span> action as below:
460 (move @-) ; move the current position to the position before the
462 (move @<) ; move to the first position
465 Other positional variables work similarly.
467 Let's see how our new example works. Whatever a key event is, the
468 input method is in its only state, <span class="fragment">init</span>. Since an event of a lower letter
469 key is firstly handled by MAP-ACTIONs, every key is changed into the
470 corresponding uppercase and put into the preedit buffer. Now this character
471 can be accessed with <span class="fragment">@-1</span>.
473 How can we tell whether the new character should be a lowercase or an
474 uppercase? We can do so by checking the character before it, i.e.
475 <span class="fragment">@-2</span>. BRANCH-ACTIONs in the <span class="fragment">init</span> state do the job.
477 It first checks if the character <span class="fragment">@-2</span> is between A to Z, between
478 a to z, or İ by the conditional below.
481 (cond ((| (& (>= @-2 ?A) (<= @-2 ?Z))
482 (& (>= @-2 ?a) (<= @-2 ?z))
486 If not, there is nothing to do specially. If so, our new key should
487 be changed back into lowercase. Since the uppercase character is
488 already in the preedit buffer, we retrieve and remember it in the
489 variable <span class="fragment">X</span> by
495 and then delete that character by
501 Lastly we re-insert the character in its lowercase form. The
502 problem here is that "İ" must be changed into "i", so we need another
503 conditional. The first branch
509 means that "if the character remembered in X is 'İ', 'i' is inserted".
514 (1 (set X (+ X 32)) (insert X))
517 starts with "1", which is always resolved into nonzero, so this branch
518 is a catchall. Actions in this branch increase <span class="fragment">X</span> by 32, then
519 insert <span class="fragment">X</span>. In other words, they change A...Z into a...z
520 respectively and insert the resulting lowercase character into the
521 preedit buffer. As the input method reaches the end of the
522 BRANCH-ACTIONs, the character is commited.
524 This new input method always checks the character before the current
525 position, so "A Quick Blown Fox" will be successfully fixed to "A
526 Quick Brown Fox" by the key sequence \<BackSpace\> \<r\>.