SaveHTMLEx and character encoding

General TRichView support forum. Please post your questions here
Post Reply
martindholmes
Posts: 131
Joined: Mon Aug 29, 2005 12:03 pm

SaveHTMLEx and character encoding

Post by martindholmes »

Hi there,

If rvsoUTF8 is included in the save options for SaveHTMLEx, the result is saved as UTF-8. But if it's not included, what format is saved? Is it UTF-16, for Unicode documents, or something else?

Cheers,
Martin
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

No, it is saved in non-Unicode (ANSI) encoding.

In this mode, Unicode text is saved using character codes (&#NNNN;), non-Unicode text is saved as it is. So, if different lines of text in the same document have different language (Charset), the resulting HTML cannot be read correctly.

In non-UTF-6 HTMLs, TRichView saves encoding identifier (in <meta> tag) basing on RVStyle.TextStyles[0].Charset. If it is DEFAULT_CHARSET, this <meta> is not saved, and browsers will autodetect encoding.
martindholmes
Posts: 131
Joined: Mon Aug 29, 2005 12:03 pm

Post by martindholmes »

I'm assuming, then, that there are three categories of text here:

ASCII (0 - 127), same in ANSI and UTF-8
128 - 255 (ANSI, saved differently)
255 + (Unicode, saved as numeric escapes)

Am I right? I've been a bit thrown in one of my projects by the fact that the TRVXML component seems to save accented latin characters in one way, and characters above 255 in another (numeric escapes, I think). Does SaveHTMLEx make the same distinction?

Cheers,
Martin
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Yes, saving characters 0..127 does not depend on encoding.
For Unicode text, characters started from 128 (not from 256!) are saved as numeric escapes.
For non-Unicode text, characters started from 128 are saved as they are, it may cause problems viewing such HTML in browsers, if documents contains multilanguage text.
All above is for HTML saving. I do not remember exactly for RVXML.

If your documents are multilanguage, I highly recommend to save them as UTF-8.
Pro: no problems with encodings, all text items, both Unicode and ANSI, are saved properly.
Contra: larger size for some languages. For example, 3 bytes per Russian character, instead of 1 byte in RUSSIAN_CHARSET. But if Russian text is represented in Unicode items in your document, it still much more efficient than saving numeric escapes.
Post Reply