Page 1 of 1

SaveHTMLEx and character encoding

Posted: Mon Jun 09, 2008 12:31 pm
by martindholmes
Hi there,

If rvsoUTF8 is included in the save options for SaveHTMLEx, the result is saved as UTF-8. But if it's not included, what format is saved? Is it UTF-16, for Unicode documents, or something else?

Cheers,
Martin

Posted: Mon Jun 09, 2008 2:16 pm
by Sergey Tkachenko
No, it is saved in non-Unicode (ANSI) encoding.

In this mode, Unicode text is saved using character codes (&#NNNN;), non-Unicode text is saved as it is. So, if different lines of text in the same document have different language (Charset), the resulting HTML cannot be read correctly.

In non-UTF-6 HTMLs, TRichView saves encoding identifier (in <meta> tag) basing on RVStyle.TextStyles[0].Charset. If it is DEFAULT_CHARSET, this <meta> is not saved, and browsers will autodetect encoding.

Posted: Mon Jun 09, 2008 8:59 pm
by martindholmes
I'm assuming, then, that there are three categories of text here:

ASCII (0 - 127), same in ANSI and UTF-8
128 - 255 (ANSI, saved differently)
255 + (Unicode, saved as numeric escapes)

Am I right? I've been a bit thrown in one of my projects by the fact that the TRVXML component seems to save accented latin characters in one way, and characters above 255 in another (numeric escapes, I think). Does SaveHTMLEx make the same distinction?

Cheers,
Martin

Posted: Tue Jun 10, 2008 8:12 am
by Sergey Tkachenko
Yes, saving characters 0..127 does not depend on encoding.
For Unicode text, characters started from 128 (not from 256!) are saved as numeric escapes.
For non-Unicode text, characters started from 128 are saved as they are, it may cause problems viewing such HTML in browsers, if documents contains multilanguage text.
All above is for HTML saving. I do not remember exactly for RVXML.

If your documents are multilanguage, I highly recommend to save them as UTF-8.
Pro: no problems with encodings, all text items, both Unicode and ANSI, are saved properly.
Contra: larger size for some languages. For example, 3 bytes per Russian character, instead of 1 byte in RUSSIAN_CHARSET. But if Russian text is represented in Unicode items in your document, it still much more efficient than saving numeric escapes.