TRichView Unicode support

gados · Post by **gados** » Thu Aug 07, 2008 9:09 am

Hi Sergey,
regarding support of Chinese characters I have a question and possibly a few suggestions.

As I was forced to investigate a problem with our application when running under a DBCS default locale, I took a closer look at the source code of TRichview. I understand, that TRichview does not support DBCS character sets and that this behaviour is documented. This is ok, although I did not know this, when the decision to use TRichview in our product was made.
Problems reported regarding Chinese character support in TRichview include:
- Word wrap behaviour (a double byte character must not be split)
- Cursor movement (2 cursor key presses are necessary to move over one Chinese character
- Painting of the selection may start inside of a double byte pair (and result in the display of bogus characters)
Ok, I think you get the idea. And I fully understand why TRichview does not support DBCS.

But I am now faced with the dilemma of either switching over to Unicode support or implementing the missing character boundary checks.

At this time, I tend to do Unicode, but I could not yet find any hint in the source code, that TRichview supports UTF-16 correctly. From what I could see, the RVU_Length function only counts 2 bytes as a character. Therefore I think, that Unicode Surrogate pairs (characters outside of the BMP (Basic Mulilingual Plane or Plane 0) would have the same problem as DBCS character sets. To my knowledge, such characters are used at least on Traditional Chinese systems. For these also, word wrap, cursor-movements etc. would have to know the actual character boundary. Actually, I think that a fix for this would also be very similar to what would be needed to support DBCS (both are using either one or 2 chars (each made up of 1 or 2 bytes)).

Could you please confirm that Unicode code points outside of the BMP are either supported or not supported by TRichview?
If they are supported, I will switch to use Unicode text in my items. Otherwise I would have to implement the necessary changes anyway and could also do it for DBCS, which would, hopefully, not have an impact on single byte encodings.

I would appreciate your thoughts on this issue.

Best regards,
Gunnar
GERMANY

Post by **Sergey Tkachenko** » Thu Aug 07, 2008 5:18 pm

DBCS was not implemented because all necessary functionality can be implemented using Unicode. While text is Unicode internally, you can still load and save DBCS text files.
I highly recommend to use Unicode instead of trying implementing DBCS in TRichView. If you find problems with Unicode processing, let me know, I'll try to fix them.

Surrogate pairs were not tested, but checks for several Unicode characters comprising a single glyph are performed, so I believe it will work on surrogates.
By the way, as far as I know, Chinese or any other live language do not use surrogates, only dead languages (such as Aztec or ancient Egyptian) do.

gados · Post by **gados** » Fri Aug 08, 2008 7:32 am

Sergey,

Thank you very much for your quick response. I will switch to Unicode. I think I saw your instructions on how to do this somewhere in this forum.

As far as I can tell from looking at the Unicode 5.0 standard (or >3.0), a large number of Chinese characters are contained in plane 2 of the standard. Some of those are part of HKSCS 2001. It seems, though, that support of the HKSCS supplementary character set is implemented in Windows (pre Vista) by mapping to BMP codepoints. Take a look at http://en.wikipedia.org/wiki/HKSCS, if you like.
Starting with Vista, HKSCS-2004 support is built-in in Windows and this seems to require correct support of UTF-16, as "ISO/IEC 10646:2003 with Amendment 1" defines a mapping which has Unicode code points in plane 2. (http://www.ogcio.gov.hk/ccli/eng/hkscs/ ... g5-iso.txt)

My concerns about surrogate pairs (which to my understanding require a correct interpretation of UTF-16) stem from the fact that one of our customers is located in Hong Kong.

Kind regards,
Gunnar