[m17n.org] [ Kyoto University, Institute for Research in Humanities, Documentation and Information Center for Chinese Studies ]

DICCS CHISE project

Last modified: Fri Sep 27 00:30:59 JST 2002


About the CHISE Project

The CHISE (CHaracter Information Service Environment) project attempts to collect and organize into a Knowledge-Base information about characters in the scripts of the world. A new processing environment based on this architecture is currently under development.

News


Development of a character processing architecture based on a character knowledge base

XEmacs UTF-2000

It is now possible to load character attributes from a external database on demand ("lazy loading"). On Intel 32 bit processor architectures, the size of the executable file thus shrinks from the 30 MB required with the traditional built to just about 15 MB. This can now be downloaded from XEmacs UTF-2000 0.19 (Koriyama). In addtion, there is a UTF-2000 branch of the XEmacs tree at cvs.m17n.org in /cvs/root, this can be accessed by anonymous CVS

A TopicMaps based approach to a character dababase

In 2001 the prototype of a Topic Map engine has been developed based on Zope. This proved less than ideal for this purpose, so the focus for this year is to port this engine to a relational database backend. Currently development continued with PostgreSQL. It is planned to enable Topic Map editing within XEmacs UTF-2000, but also to allow multiple clients in addtion to this.

Database of features of characters

Database of the component structure of Chinese Characters

Based on the Ideographic Description Characters (IDS) in ISO/IEC 10646-1:2000 and Unicode, we are now developping a database that expresses the structure of Chinese Characters using this syntax. At the moment, we are using the characters in the Unicode tables as a reference. The basic CJK Unified Ideographs, as well as Extension A and Extension B, together more than 70000 characters are currently covered.


Table of the component structure database

The following tables are currently available via anonymous CVS from cvs.m17n.org at /cvs/chise as module ids:

IDS-UCS-Basic.txt
CJK Unified Ideographs (U+4E00 〜 U+9FA5) of ISO/IEC 10646-1:2000
IDS-UCS-Ext-A.txt
CJK Unified Ideographs Extension A (U+3400 〜 U+4DB5, U+FA1F and U+FA23) of ISO/IEC 10646-1:2000
IDS-UCS-Compat.txt
CJK Compatibility Ideographs (U+F900 〜 U+FA2D, except U+FA1F and U+FA23) of ISO/IEC 10646-1:2000
IDS-UCS-Ext-B-1.txt
CJK Unified Ideographs Extension B [part 1] (U-00020000 〜 U-00021FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-2.txt
CJK Unified Ideographs Extension B [part 2] (U-00022000 〜 U-00023FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-3.txt
CJK Unified Ideographs Extension B [part 3] (U-00024000 〜 U-00025FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-4.txt
CJK Unified Ideographs Extension B [part 4] (U-00026000 〜 U-00027FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-5.txt
CJK Unified Ideographs Extension B [part 5] (U-00028000 〜 U-00029FFF) of ISO/IEC 10646-2:2001
IDS-UCS-Ext-B-6.txt
CJK Unified Ideographs Extension B [part 6] (U-0002A000 〜 U-0002A6D6) of ISO/IEC 10646-2:2001
IDS-UCS-Compat-Supplement.txt
CJK Compatibility Ideographs Supplement (U-0002F800 〜 U-0002FA1D) of ISO/IEC 10646-2:2001
IDS-Daikanwa-01.txt
Morohashi: Daikanwa Jiten, Volume 1
IDS-Daikanwa-02.txt
Morohashi: Daikanwa Jiten, Volume 2
IDS-Daikanwa-03.txt
Morohashi: Daikanwa Jiten, Volume 3
IDS-Daikanwa-04.txt
Morohashi: Daikanwa Jiten, Volume 4
IDS-Daikanwa-05.txt
Morohashi: Daikanwa Jiten, Volume 5
IDS-Daikanwa-06.txt
Morohashi: Daikanwa Jiten, Volume 6
IDS-Daikanwa-07.txt
Morohashi: Daikanwa Jiten, Volume 7
IDS-Daikanwa-08.txt
Morohashi: Daikanwa Jiten, Volume 8
IDS-Daikanwa-09.txt
Morohashi: Daikanwa Jiten, Volume 9
IDS-Daikanwa-10.txt
Morohashi: Daikanwa Jiten, Volume 10
IDS-Daikanwa-11.txt
Morohashi: Daikanwa Jiten, Volume 11
IDS-Daikanwa-12.txt
Morohashi: Daikanwa Jiten, Volume 12
IDS-Daikanwa-dx.txt
Morohashi: Daikanwa Jiten, Additions
IDS-Daikanwa-ho.txt
Morohashi: Daikanwa Jiten, Appendix
IDS-CBETA.txt
Characters encountered by the Chinese Buddhist Electronic Text Association (CBETA)

Intgegration and Composition of Character Glyphs and Styles

In the character database is information about character glyphs and styles collected. This allows to use this information together with the other knowledge about a character in the database to built a system that uses the component structure information to assemble the font for a character depending on the contextual requirements from its components. With this system, occurrences of mismatches based on erroneous association or insufficient contextual information are excluded, and it will be possible easily display and print character forms that have not been codified and for which no fonts exists .

Mathematical analysis and visualation of character knowledge


Mailing List

Discussion about the CHISE Project occur in the CHISE-{ja|en} mailing list.

Anybody who would like to take part in the discussion about and development of the CHISE Project, has ideas or questions about the implementation or wishes for new features is welcome to join either the English, or the Japanese or both lists.

To become a member in the CHISE mailing, send a message to the following address:

For Japanese:
chise-ja-ctl@m17n.org
For English:
chise-en-ctl@m17n.org
with the word
subscribe Your Name
in the body of the message. You will then receive a conformation message with the line
confirm PASSWORD Your Name
You will have to reply to this message to become a member.

Papers and Presentations

History

This project was assisted by 未踏ソフトウェア創造事業, 2001.


[Above]


[ Documentation and Information Center for Chinese Studies (DICCS), Institute for Research in the Humanities, Kyoto University ] [ m17n.org (the Organization for Multilingualization) (National Institute of Advanced Industrial Science and Technology) ]
[ Hanazono University ] [ National Institute of Advanced Industrial Science and Technology ] [ Dept. of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University ]


Last modified: Fri Nov 8 14:53:35 JST 2002 . counter since Oct 9 2002.