Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?

CJKV Information Processing 2nd ed. 52

stoolpigeon writes "At the end of last year, I made a move from an IT shop focused on supporting the US side of our business to a department that provides support to our operations outside the US. This was the first time I've worked in an international context and found myself, on a regular basis, running into long-time assumptions that were no longer true. My first project was implementing a third-party, web-based HR system for medium-sized offices. I found myself constantly missing important issues because I had such a narrow approach to the problem space. Sure, I've built applications and databases that supported Unicode, but I've never actually implemented anything with them but the same types of systems I'd built in the past with ASCII. But a large portion of the world's population is in Asia, and ASCII is certainly not going to cut it there. Fortunately, a new edition of Ken Lunde's classic CJKV Information Processing has become available, and it has really opened my eyes." Keep reading for the rest of JR's review.
CJKV Information Processing 2nd ed.
author Ken Lunde
pages 898
publisher O'Reilly Media, Inc.
rating 10/10
reviewer JR Peck
ISBN 978-0-596-51447-1
summary Chinese, Japanese, Korean and Vietnamese computing.
CJKV Information Processing has a long history that actually goes back into the 1980s. It began as a simple text document JAPAN.INF, available via FTP on a number of servers. This document was excerpted and refined and published as Lunde's first book in 1993, Understanding Japanese Information Processing. Shortly after JAPAN.INF became CJK.INF and the foundation for the first edition of CJKV Information Processing was born. The first edition was published in 1999, and it is safe to say that a number of important things have changed over the last 10 years. Lunde states four major developments that prompted this second edition in the preface. They are the emergence of Unicode, OpenType and the Portable Document Format (PDF) as preferred tools and lastly the maturity of the web in general to use Unicode and deal with a wider range of languages and their character sets.

Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, "..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way..." Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn't just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.

That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a 'cook book' approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.

The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That's only the first quarter of the book, so I don't know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde's approach.

The style is very readable, but I wouldn't just hand this to someone who didn't have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I've found that Lunde's work is indispensable. It is not just my go to reference, it's essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.

There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it's own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O'Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It's a minute flaw in what is a great book."

You can purchase CJKV Information Processing 2nd ed. from Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.


This discussion has been archived. No new comments can be posted.

CJKV Information Processing 2nd ed.

Comments Filter:
  • QUE? (Score:2, Funny)

    by Em Emalb ( 452530 )
  • CJKV is.... (Score:5, Informative)

    by ForexCoder ( 1208982 ) on Wednesday July 08, 2009 @02:13PM (#28625793)
    CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization.
    The term CJKV means CJK plus Vietnamese, which in the past used Hán t/Chinese characters and Ch Nôm prior to adopting Quc Ng. []
    • uh (Score:1, Insightful)

      by Anonymous Coward

      why modded troll?

    • by Anonymous Coward

      Interesting slant.

    • Yeah, I was wondering about the characters on the cover, behind the CJKV.

      I was familiar with the first three, the kanji used to represent China (the character for middle), Japan (the character for the sun) and Korea (the character for... Korea), but I didn't realize the last one was the character for Vietnam. It normally means to wake or cause. FWIW, the old name for Vietnam in Japanese seems to be "etsunan", which I guess is pretty close phonetically.
      • ...but I didn't realize the last one was the character for Vietnam. It normally means to wake or cause.

        Doh, nope! :) Actually, it's not the character for "wake" or "cause", i.e. okiru or okosu, but rather the character for "exceed" or "pass through", as in the Japanese words koeru or kosu.

        FWIW, the old name for Vietnam in Japanese seems to be "etsunan", which I guess is pretty close phonetically.

        The etsu part in Japanese is pronounced yuè in Mandarin Chinese (link []). The "u" is kinda pinched in pro

        • Yeah, of course you're right. Is it a feeble excuse to say that I'm used to reading with okurigana? :-(
          • Well, FWIW, the koeru kanji looks not too far from the okiru kanji; they both have the same bushu or radical, the bit going down the left and extending across underneath, which happens to be one of the larger bushu too. :) And, for that matter, there are two kanji used for koeru / kosu, one with the on reading (i.e., the reading(s) generally used in compounds and that came originally from Chinese) of etsu, and the other read as chô, as in chô kawaii!

            So no worries, hey, it's Japanese. Whee!


    • Thanks!

      And to think someone thought that CHKV was a programming language :)

  • by WillAdams ( 45638 ) on Wednesday July 08, 2009 @02:19PM (#28625883) Homepage

    is likely a limitation of the use of FrameMaker to compose the document and an unwillingness to set up new styles to put them together (unfortunately O'Reilly hasn't use TeX for a title since _Making TeX Work_) and was probably let stand since they needed a particular page count to come out to even signatures anyway.


  • When I was working on my JavaCC book [] I bought Jukka Korpela's Unicode Explained [] and it was *extremely* helpful. After reading it I actually felt comfortable using various tools to convert from one encoding to another, discussing multibyte character sets, and so forth. It helped me write the Unicode chapter in my book with some confidence. It was the first time I had used vi to enter Unicode characters... fun times.

    That said, it sounds like "CJKV Information Processing" covers some of the same ground. Has anyone read both?

    • by Anonymous Coward

      It's going to have a big overlap, but the additional, crucially important material with CJKV processing is the non-Unicode encoding systems that have been used for those scripts, and the input methods that are used to enter the scripts into the computer. A general-purpose Unicode book will not go into a lot of depth about either of these topics.

    • by KlaymenDK ( 713149 ) on Wednesday July 08, 2009 @02:44PM (#28626333) Journal

      "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is also a very good -- but very much shorter -- introduction to Unicode. []

      I frequently send this to people that I need to work with who don't "get" it.

      • by bcrowell ( 177657 ) on Wednesday July 08, 2009 @06:11PM (#28629145) Homepage

        Nice article -- thanks for providing the link! I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."

        This is not a hard problem to solve in the case of email and web pages, which can have encoding given in headers. (If you validate your page using the w3c validator, it will warn you if you didn't supply an encoding.) It's also not an insanely hard problem for strings in memory; the encoding can be either set by your encoding convention or handled behind the scenes by your language (as in perl).

        What really sucks is files. For instance, I wrote this [] extremely simple terminal-based personal calendar program in perl, and it's actually attracted a decent number of users. It's internationalized in 11 languages. Well, one day a user sends me an email complaining that the program is giving him mysterious error messages. He sends me his calendar file, which is a plain text file with some Swedish in it. I run the program on my machine with his calendar file, and it works fine. I can't reproduce the bug. We go through a few rounds of confused communication before I finally realize that he must have had the file encoded in Latin-1 on his end, whereas my program is documented as requiring utf-8. So now my program has to include the following cruft:

        sub file_is_valid_utf8 {
        my $f = shift;
        open(F,"<:raw",$f) or return 0;
        local $/;
        my $x=<F>;
        close F;
        return is_valid_utf8($x);

        # What's passed to this routine has to be a stream of bytes, not a utf8 string in which the characters are complete utf8 characters.
        # That's why you typically want to call file_is_valid_utf8 rather than calling this directly.
        sub is_valid_utf8 {
        my $x = shift;
        return utf8::decode(my $dummy = $x);

        Yech. It requires reading the file twice, and it's not even 100% reliable.

        This is the kind of situation where the Unix philosophy, based on plain text files and little programs that read and write them, really runs into a problem. With hindsight, it would have been really, really helpful if Unix filesystems could have included just a smidgen more metadata, enough to specify the character encoding.

        • you could also always open the file for reading and writing with the utf8 encoding, that way it wouldn't matter what the user sets up for their environment.
        • Yech. It requires reading the file twice, and it's not even 100% reliable.

          AFAIK it's not possible to do it in a 100% reliable fashion, but there are technical solutions where the file doesn't need to be read twice. Java, despite all of its flaws, handles this sort of thing pretty well, so I'll use that as an example.

          In Java, there is a distinction between byte-based and character-based I/O. InputStream [] and OutputStream [] are byte-based I/O classes; Reader [] and Writer [] are character-based. Then you have clas

        • Re: (Score:3, Interesting)

          by spitzak ( 4019 )

          What you are encountering is a typical moron implementatin of UTF-8.

          For some reason otherwise intelligent programmers lose their minds when presented with UTF-8. They act as though the program will crash instantly if they ever make a pointer that points at the middle of a character, or if they fail to correclty count the "characters" in a string and dare to use an offset or number of bytes. I am not really certain what causes these diseases but being exposed to decades of character==byte ASCII programming s

        • What really sucks is files.

          Indeed. Which is why Bush hid the facts [].

        • Re: (Score:3, Interesting)

          by david.given ( 6740 )

          I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."

          My Unicode mantra is:

          "You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you."

          This is because a Unicode printable thing can span multiple bytes and multiple code points. You can't find the nth character in a string, firstly because Unicode doesn't really have such a concept as a character, and secondly because you don't know where it is. This Java code:

          char c = s.charAt(4);

          ...doesn't do what people think it does --- it returns the 4th UTF-

          • You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you.

            It's been interesting reading different people's replies to my post. One thing I've noticed is that each of us is talking about the language he's most familiar with. I was writing about a situation I encountered with perl. You're talking about java. Other people are talking about C.

            Your comment applies to java but not to perl. In perl, you really can do random access on strings. All the int

            • Actually, I do it mostly in C --- I picked Java for that example because it has a really simple example of getting it wrong.

              And when you say Perl supports random access of Unicode strings, are you sure it's not just giving you random access to an array of Unicode code points --- which is also wrong? Remember that a single Unicode glyph can be made up of an arbitrary number of code points.

              Even in European languages, trying to split a string between the combining accent code point and the base character c

              • And when you say Perl supports random access of Unicode strings, are you sure it's not just giving you random access to an array of Unicode code points --- which is also wrong? Remember that a single Unicode glyph can be made up of an arbitrary number of code points.

                Interesting point. Some documentation: man perlunicode [], man perluniintro [], Unicode::Normalize []. I spent some time studying these, and concluded that I didn't understand enough to answer your question :-)

      • by ld a,b ( 1207022 )

        If you work with them it is easier, hopefully you can try to get them fired or at least coerced into doing it right.

        With free software programmed by volunteers it is even worse. Many such volunteers are great coders but they come from ASCII countries and as such don't "get" while tail should perform worse than it used to do, or why should they care about character width instead of strlen, or why should they update an algorithm they borrowed from K&R 30 years ago.

        Truth is, with UTF-8 while you lose the c

    • by slarrg ( 931336 )
      Gee, my methods were different than most: I married a Ukrainian woman. Having a wife who knows several languages, each with different 8-bit encodings, using computers in your house on a daily basis makes you appreciate Unicode in a hurry.
      • > Gee, my methods were different than most: I married a Ukrainian woman.

        Hehe, yeah, actually, my wife is Romanian, so all my JavaCC Unicode examples involve s-with-cedilla and stuff like that :-) Buna zuia!

  • by jholder ( 22001 ) on Wednesday July 08, 2009 @02:37PM (#28626207) Homepage Journal
    I used the first ed years ago, and sure enough, Unicode, OTF, anf PDF dominate my world now. The only thing that is complicated enough to need additional exposition would be Arabic, with it's ability to not only combine RTL and LTR text (Hebrew does as well) but has to be shaped contextually.
    • by brusk ( 135896 )
      Actually Arabic (and Persian, using almost the same alphabet) isn't the only such case; there are lots of complicated issues with S Asian and SE Asian scripts (not to mention Mongolian, which like Arabic has initial, medial and final forms--but is, properly, written vertically).
  • ...nearly every week there will be a new O'Reilly book on something you've never heard of.

    • by radtea ( 464814 )

      It's well known that CJKV is more like QPZA than ASDF, although TYRX process is probably better documented than either.

      Recent developments in RWRI technology have seen a lot of uptake by the IRWR community, leading some to believe that ASDF is on its way out entirely.

      That's a completely clear and informative SUMMARY of the issue, right?

  • Fonts and encoding (Score:4, Interesting)

    by jbolden ( 176878 ) on Wednesday July 08, 2009 @04:25PM (#28627825) Homepage

    I own the first edition of CJKV but I find Fonts and encodings [] to be far more useful. Obviously if you are working heavily in any of these languages the 2nd best book is worth having but I'd say that F&E feels like a systematic treatment while CJKV feels like 1000 pages of webarticles on the topic.

    • by jholder ( 22001 )
      This looks really good, I'd definitely have gotten this book if it existed back when I had to learn all this stuff starting in 1997-2000.
  • you have to support Turkish! where simple things like
    ... path="C:\Program Files"
    ... path=path.toUpper
    will cause PathDoesNotExistException.

    You need to go through the whole code base and remove any case-changes that happen with the letter "i" or letter "I".
    Because Turkish is the ONLY alphabet where the uppercase version of 7bit "i" has 8 bits! Undotted i []
    • Re: (Score:3, Insightful)

      by corsec67 ( 627446 )

      Changing the case of a path SHOULD cause it to refer to a different path.

      Here is 5 cents, go buy yourself a better computer.

      • by idji ( 984038 )
        if you are in the *nix world. in windows not so. has nothing to do with a new computer. Turkey is largely a HP + Microsoft world in government and large business.

An elephant is a mouse with an operating system.