[m17n.org] [ Kyoto University, Institute for Research in Humanities, Documentation and Information Center for Chinese Studies ]

CHISE project

	CHISE project

Last modified: Fri Sep 27 00:30:59 JST 2002

About the CHISE Project

The CHISE (CHaracter Information Service Environment) project attempts to collect and organize into a Knowledge-Base information about characters in the scripts of the world. A new processing environment based on this architecture is currently under development.

News

2002-12-07 (sat) The special session on CHISE Project will be held in the 5th meeting of the Japan Association for East Asian Text Processing (JAET) at Hanazono University, Kyoto.
2002-08-21 XEmacs UTF-2000 0.19 (Koriyama) has been released.

Development of a character processing architecture based on a character knowledge base

XEmacs UTF-2000

It is now possible to load character attributes from a external database on demand ("lazy loading"). On Intel 32 bit processor architectures, the size of the executable file thus shrinks from the 30 MB required with the traditional built to just about 15 MB. This can now be downloaded from XEmacs UTF-2000 0.19 (Koriyama). In addtion, there is a UTF-2000 branch of the XEmacs tree at cvs.m17n.org in /cvs/root, this can be accessed by anonymous CVS

A TopicMaps based approach to a character dababase

In 2001 the prototype of a Topic Map engine has been developed based on Zope. This proved less than ideal for this purpose, so the focus for this year is to port this engine to a relational database backend. Currently development continued with PostgreSQL. It is planned to enable Topic Map editing within XEmacs UTF-2000, but also to allow multiple clients in addtion to this.

Database of features of characters

Database of the component structure of Chinese Characters

Based on the Ideographic Description Characters (IDS) in ISO/IEC 10646-1:2000 and Unicode, we are now developping a database that expresses the structure of Chinese Characters using this syntax. At the moment, we are using the characters in the Unicode tables as a reference. The basic CJK Unified Ideographs, as well as Extension A and Extension B, together more than 70000 characters are currently covered.

Table of the component structure database

The following tables are currently available via anonymous CVS from cvs.m17n.org at /cvs/chise as module ids:

IDS-UCS-Basic.txt
CJK Unified Ideographs (U+4E00 〜 U+9FA5) of ISO/IEC 10646-1:2000
IDS-UCS-Ext-A.txt
CJK Unified Ideographs Extension A (U+3400 〜 U+4DB5, U+FA1F and U+FA23) of ISO/IEC 10646-1:2000
IDS-UCS-Compat.txt
CJK Compatibility Ideographs (U+F900 〜 U+FA2D, except U+FA1F and U+FA23) of ISO/IEC 10646-1:2000
IDS-UCS-Ext-B-1.txt
CJK Unified Ideographs Extension B [part 1] (U-00020000 〜 U-00021FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-2.txt
CJK Unified Ideographs Extension B [part 2] (U-00022000 〜 U-00023FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-3.txt
CJK Unified Ideographs Extension B [part 3] (U-00024000 〜 U-00025FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-4.txt
CJK Unified Ideographs Extension B [part 4] (U-00026000 〜 U-00027FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-5.txt
CJK Unified Ideographs Extension B [part 5] (U-00028000 〜 U-00029FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-6.txt
CJK Unified Ideographs Extension B [part 6] (U-0002A000 〜 U-0002A6D6) of ISO/IEC 10646-2:2001
IDS-UCS-Compat-Supplement.txt
CJK Compatibility Ideographs Supplement (U-0002F800 〜 U-0002FA1D) of ISO/IEC 10646-2:2001
IDS-Daikanwa-01.txt
Morohashi: Daikanwa Jiten, Volume 1
IDS-Daikanwa-02.txt
Morohashi: Daikanwa Jiten, Volume 2
IDS-Daikanwa-03.txt
Morohashi: Daikanwa Jiten, Volume 3
IDS-Daikanwa-04.txt
Morohashi: Daikanwa Jiten, Volume 4
IDS-Daikanwa-05.txt
Morohashi: Daikanwa Jiten, Volume 5
IDS-Daikanwa-06.txt
Morohashi: Daikanwa Jiten, Volume 6
IDS-Daikanwa-07.txt
Morohashi: Daikanwa Jiten, Volume 7
IDS-Daikanwa-08.txt
Morohashi: Daikanwa Jiten, Volume 8
IDS-Daikanwa-09.txt
Morohashi: Daikanwa Jiten, Volume 9
IDS-Daikanwa-10.txt
Morohashi: Daikanwa Jiten, Volume 10
IDS-Daikanwa-11.txt
Morohashi: Daikanwa Jiten, Volume 11
IDS-Daikanwa-12.txt
Morohashi: Daikanwa Jiten, Volume 12
IDS-Daikanwa-dx.txt
Morohashi: Daikanwa Jiten, Additions
IDS-Daikanwa-ho.txt
Morohashi: Daikanwa Jiten, Appendix
IDS-CBETA.txt
Characters encountered by the Chinese Buddhist Electronic Text Association (CBETA)

Koichi KAMICHI ( Forum for development of on-the-fly generation of Kanji Fonts ) Analytic tool for Kanji Fonts (in Japanese)

Intgegration and Composition of Character Glyphs and Styles

In the character database is information about character glyphs and styles collected. This allows to use this information together with the other knowledge about a character in the database to built a system that uses the component structure information to assemble the font for a character depending on the contextual requirements from its components. With this system, occurrences of mismatches based on erroneous association or insufficient contextual information are excluded, and it will be possible easily display and print character forms that have not been codified and for which no fonts exists .

Forum for development of on-the-fly generation of Kanji Fonts

Mathematical analysis and visualation of character knowledge

Yoshi Fujiwara, Yasuhiro Suzuki, Tomohiko Morioka, “ Network of Words”, Artificial Life and Robotics 2002 ( Presentation material )
Model for the relation of Kanji characters that share a component

Image 1

Image 2

Mailing List

Discussion about the CHISE Project occur in the CHISE-{ja|en} mailing list.

Anybody who would like to take part in the discussion about and development of the CHISE Project, has ideas or questions about the implementation or wishes for new features is welcome to join either the English, or the Japanese or both lists.

To become a member in the CHISE mailing, send a message to the following address:

For Japanese:: chise-ja-ctl@m17n.org
For English:: chise-en-ctl@m17n.org

with the word

subscribe Your Name

in the body of the message. You will then receive a conformation message with the line

confirm PASSWORD Your Name

You will have to reply to this message to become a member.

Papers and Presentations

About XEmacs UTF-2000
About mathematical analysis of Character Information
Other
- “Model and Implementation of a Next Generation Multilingual Processing System” (in Japanese. October 1999)
- WITTERN, Christian, “Non-system characters in XML documents”, in: The Frontier of Asian Information Processing [Seminar Series of the National Documentation and Information Centers in Humanities] No. 10, November 2000
- MORIOKA Tomohiko, “The UTF-2000 Project”, in: Kanji and Information, No.2, March 2001 (in Japanese)
- MORIOKA Tomohiko, “CHISE project &emdash; beyond the UTF-2000”, m17n2001: the Fifth International Symposium on Multilingual Information Processing and Open Source Software .
- MORIOKA Tomohiko, “A Short Introduction to UTF-2000 Project”, the First TEI Character Set Issues Working Group (October 2001, University of California, Berkeley, USA).
- WITTERN, Christian, “What is Digitisation?”, in: Kanji and Information, No.3, October 2001 (in Japanese).
- MORO, Shigeki, “The meaning of 'beyond character codes'”, in: Kanji and Information, No.3, October 2001 (in Japanese).
- WITTERN, Christian, “Some thoughts on the digitization of Kanji”, Information Technology and the Humanities [Seminar Series of the National Documentation and Information Centers in Humanities] No. 11, November 2001.
- KAMICHI, Koichi, “Building KAGE (Kanji-font Automatic Generating Engine): The Next Gerenation of Kanji Processing beyond the Character Code Model” in Journal of Japan Association for East Asian Text Processing (JAET) No. 3, October 2002 (in Japanese).
- MORO, Shigeki, “Software Review: CHISE Project,” in Journal of Japan Association for East Asian Text Processing (JAET) No. 3, October 2002 (in Japanese).

History

This project was assisted by 未踏ソフトウェア創造事業, 2001.

[Above]

[ Documentation and Information Center for Chinese Studies (DICCS), Institute for Research in the Humanities, Kyoto University ] [ m17n.org (the Organization for Multilingualization) (National Institute of Advanced Industrial Science and Technology) ]
[ Hanazono University ] [ National Institute of Advanced Industrial Science and Technology ] [ Dept. of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University ]

Last modified: Fri Nov 8 14:53:35 JST 2002 .

since Oct 9 2002.