C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

wchar_t,UTF-8, UTF-16, UTF-32 convertability
Goto page 1, 2  Next
 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language, library and standards
View previous topic :: View next topic  
Author Message
Steven T. Hatton
Guest





PostPosted: Thu Mar 09, 2006 5:06 pm    Post subject: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote



It took me some doing to understand exactly what this meant:
<quote url="http://xml.apache.org/xerces-c/build-misc.html#XMLChInfo">
XMLCh should be defined to be a type suitable for holding a utf-16 encoded
(16 bit) value, usually an unsigned short.

All XML data is handled within Xerces-C++ as strings of XMLCh characters.
Regardless of the size of the type chosen, the data stored in variables of
type XMLCh will always be utf-16 encoded values.

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).

Some earlier releases of Xerces-C++ defined XMLCh to be the same type as
wchar_t on most platforms, with the goal of making it possible to pass
XMLCh strings to library or system functions that were expecting wchar_t
parameters. This approach has been abandoned because of

* Portability problems with any code that assumes that the types of XMLCh
and wchar_t are compatible
* Excessive memory usage, especially in the DOM, on platforms with 32 bit
wchar_t.
* utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t
on Solaris and Linux. The problem occurs with Unicode characters with
values greater than 64k; in ucs-4 the value is stored as a single 32 bit
quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still
create the utf-16 encoded surrogate pairs, which are illegal in ucs-4
encoded wchar_t strings.
</quote>

I really don't know the status of the Standard Library progress vis-a-vis
Unicode support. The evidence of the above example demonstrates that there
is a shortcoming in the current ISO/IEC 14882:2003. It is my opinion that
the C++ Standard should specify that the implementation support UTF-8,
UTF-16, and UTF-32 encoding for all locales supported by the
implementation. There should be a character type and corresponding string
class for each of these encodings. Furthermore, the string classes should
support conversion from one encoding to another.

I will stop short of saying there should be an assignment operator defined
so that these strings could be mutually assigned.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Alberto Ganesh Barbati
Guest





PostPosted: Fri Mar 10, 2006 12:06 am    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote



Steven T. Hatton ha scritto:
Quote:

I really don't know the status of the Standard Library progress vis-a-vis
Unicode support. The evidence of the above example demonstrates that there
is a shortcoming in the current ISO/IEC 14882:2003. It is my opinion that
the C++ Standard should specify that the implementation support UTF-8,
UTF-16, and UTF-32 encoding for all locales supported by the
implementation. There should be a character type and corresponding string
class for each of these encodings. Furthermore, the string classes should
support conversion from one encoding to another.

I will stop short of saying there should be an assignment operator defined
so that these strings could be mutually assigned.

The "Call for proposals" for the upcoming Library Extension TR2 (which
you can find at
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1810.html)
explicitly says:

<quote>
The committee especially welcomes proposals in the following areas:

* Unicode
* XML and HTML
* Networking
* Usability for novices and occasional programmers
</quote>

so the issues you are concerned about have already been acknowledged as
relevant by the committee. If you have a formal proposal (not just a
wish list!), then you are very welcome!

Ganesh

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Pete Becker
Guest





PostPosted: Fri Mar 10, 2006 8:06 am    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote



Steven T. Hatton wrote:

[quoting some documentation:]

Quote:

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).


This is nonsense. wchar_t does not have an encoding. It's simply storage
for fixed-width wide characters. How you interpret values stored in a
wchar_t is up to you. Note that if its size is 16 bits it is still not
suitable for UTF-16, which is a variable-width encoding. If wchar_t is
32 bits wide it certainly can be used for Unicode. You have to have a
locale that supports whatever flavor of fixed-width encoding you're
interested in using.

Quote:
I really don't know the status of the Standard Library progress vis-a-vis
Unicode support. The evidence of the above example demonstrates that there
is a shortcoming in the current ISO/IEC 14882:2003.

Nah, it shows that some people shouldn't be allowed to talk about
character encodings in public. <g>

Quote:
It is my opinion that
the C++ Standard should specify that the implementation support UTF-8,
UTF-16, and UTF-32 encoding for all locales supported by the
implementation. There should be a character type and corresponding string
class for each of these encodings. Furthermore, the string classes should
support conversion from one encoding to another.

First, C and C++ locales handle this sort of conversion. The
basic_string template knows nothing about locales, and rightly so.

The current model is that these are file formats, not string
representations. You translate from whatever file format you have into a
fixed-width representation whose data values are held in a basic_string
object. To write a file, translate from the internal (fixed-width)
representation into the external representation.

Describing how this is done in C might make it a little clearer. You can
read a file encoded in shift-JIS into a char array (a multi-byte
representation), then use, say, mbstowcs or one of its variants, along
with a suitable locale, to transform that text and copy the result into
a wchar_t array (a fixed-width representation). In C++, that translation
would be done with a codecvt locale facet.

It _might_ be appropriate to have a standard type for managing sequences
of UTF-8-encoded characters (or other variable-width encodings), but
basic_string is not the right base for that. It was designed to work
with fixed-width character representations. Dealing with variable-width
representations requires code that's big and slow. It's rarely worth the
cost.

--

Pete Becker
Roundhouse Consulting, Ltd.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
P.J. Plauger
Guest





PostPosted: Fri Mar 10, 2006 8:06 am    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

""Steven T. Hatton"" <hattons (AT) globalsymmetry (DOT) com> wrote in message
news:v-2dnW2-8d9OZJLZRVn-tg (AT) speakeasy (DOT) net...

Quote:
I really don't know the status of the Standard Library progress vis-a-vis
Unicode support. The evidence of the above example demonstrates that
there
is a shortcoming in the current ISO/IEC 14882:2003. It is my opinion that
the C++ Standard should specify that the implementation support UTF-8,
UTF-16, and UTF-32 encoding for all locales supported by the
implementation. There should be a character type and corresponding string
class for each of these encodings. Furthermore, the string classes should
support conversion from one encoding to another.

See TR19769, a C Technical Report that adds char16_t and char32_t,
for characters of predictable size. The next version of our library
includes these, using UTF-16 for char16_t, UTF-32 for char32_t, and
conversions to UTF-8 for both. FWIW, these types are also being
proposed for C++0X.

We also provide (as an extension) a converter class that converts
strings between all sorts of encodings, including the above.

Quote:
I will stop short of saying there should be an assignment operator defined
so that these strings could be mutually assigned.

Good.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Guest






PostPosted: Fri Mar 10, 2006 3:27 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

Alberto Ganesh Barbati wrote:

Quote:
quote
The committee especially welcomes proposals in the following areas:

* Unicode
* XML and HTML
* Networking
* Usability for novices and occasional programmers
/quote

The Usability for novices and occasional programmers area would
encompass a GUI I guess? FWIW heres a transcript of an email I got
recently from a Windows XP user:

"Hi Andy, Thanks for the utility. When I click on it a box flashes on
the screen and then disappears. Nothing else happens. Please advise."

kind of says it all...

I hope to keep the GUI ball rolling with the intent to write some sort
of standardisation proposal eventually. Meanwhile I'd also be
interested in any papers anyone else has written on the subject of a
standard C++ GUI.
I started a very thin paper in the Boost Vault in the Graphical User
Interface directory. From little acorns maybe oak trees grow and all
that... Anyone who has views please feel free to add stuff there for or
against or get in touch via the boost developers list.

regards
Andy Little

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Dietmar Kuehl
Guest





PostPosted: Fri Mar 10, 2006 4:06 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

Pete Becker wrote:
Quote:
If wchar_t is
32 bits wide it certainly can be used for Unicode. You have to have a
locale that supports whatever flavor of fixed-width encoding you're
interested in using.

Sadly, Unicode is not a fixed-width character representation [anymore]:
due to the presence of "combining characters" you may have characters
which require multiple words to be represented. Thus, I would still
call the internal representation of Unicode characters a "variable
width encoding" although I agree that the internal represention of
characters and character sequences are whatever the implementation
considers reasonable. The effect is that the C++ standard library
indeed does not provide the necessary tools to effectively process
internationalized characters. However, I'm not sure that other
libraries address these issues.
--
<mailto:dietmar_kuehl (AT) yahoo (DOT) com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Pete Becker
Guest





PostPosted: Fri Mar 10, 2006 4:55 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

Dietmar Kuehl wrote:

Quote:

Sadly, Unicode is not a fixed-width character representation [anymore]:
due to the presence of "combining characters" you may have characters
which require multiple words to be represented.

You're talking about glyphs, not characters. A glyph is the thing that
gets shown on the screen; some glyphs in Unicode require as many as five
characters. In terms of searching for text, advancing through text,
etc., however, Unicode is fixed-width. Just ask the Unicode folks. <g>

--

Pete Becker
Roundhouse Consulting, Ltd.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
P.J. Plauger
Guest





PostPosted: Fri Mar 10, 2006 6:16 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

"Dietmar Kuehl" <dietmar_kuehl (AT) yahoo (DOT) com> wrote in message
news:47d54dFf5cf4U1 (AT) individual (DOT) net...

Quote:
Pete Becker wrote:
If wchar_t is
32 bits wide it certainly can be used for Unicode. You have to have a
locale that supports whatever flavor of fixed-width encoding you're
interested in using.

Sadly, Unicode is not a fixed-width character representation [anymore]:
due to the presence of "combining characters" you may have characters
which require multiple words to be represented. Thus, I would still
call the internal representation of Unicode characters a "variable
width encoding" although I agree that the internal represention of
characters and character sequences are whatever the implementation
considers reasonable. The effect is that the C++ standard library
indeed does not provide the necessary tools to effectively process
internationalized characters. However, I'm not sure that other
libraries address these issues.

The difference is that ISO SC22 passed a resolution in the late 1980s
saying that programming languages need not deal with groups of
combining characters as single characters. It's considered acceptable
to require an extra layer of software, above the standard library, to
deal with such groups, much as a word processor has to impose some
idea of words on the character sequences manipulated by standard
library functions. You don't have this option with UTF-8 in a string
of char, or with UTF-16 in a string of wchar_t. And while you can
canonicalize groups of combining characters into single elements,
you can't do the same with UTF-8 or UTF-16 sequences.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
P.J. Plauger
Guest





PostPosted: Fri Mar 10, 2006 7:06 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

"skaller" <skaller (AT) users (DOT) sourceforge.net> wrote in message
news:pan.2006.03.10.16.40.26.209033 (AT) users (DOT) sourceforge.net...

Quote:
On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

See TR19769, a C Technical Report that adds char16_t and char32_t,
for characters of predictable size. The next version of our library
includes these, using UTF-16 for char16_t, UTF-32 for char32_t, and
conversions to UTF-8 for both. FWIW, these types are also being
proposed for C++0X.

What is wrong with uint16_t and uint32_t from inttypes.h of ISO C99
together with a requirement these types be mandatory?

Not much. From TR19769:

: 3 The new typedefs
:
: This Technical Report introduces the following two new typedefs,
: char16_t and char32_t:
:
: typedef T1 char16_t;
: typedef T2 char32_t;
:
: where T1 has the same type as uint_least16_t and T2 has the same
: type as uint_least32_t.
:
: The new typedefs guarantee certain widths for the data types,
: whereas the width of wchar_t is implementation defined. The data
: values are unsigned, while char and wchar_t could take signed
: values.
:
: This Technical Report also introduces the new header:
:
: <uchar.h>
:
: The new typedefs, char16_t and char32_t, are defined in
: <uchar.h>.

The TR also defines literals for these types:

u'x' u"abc"
U'x' U"abc"

and conversion functions (of the mbrto* and *tomb variety)
for converting between byte sequences and these character
sequences. (And there are feature-test macros to ensure
that the encodings are really UTF-16 and UFT-32, to round
out the functionality.)

I'd say that's not much different than what you're asking
for.

BTW, support for <uchar.h> is in our new library, and EDG
should be shipping a front end that supports the literals
in a matter of weeks.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
skaller
Guest





PostPosted: Fri Mar 10, 2006 7:06 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

Quote:
See TR19769, a C Technical Report that adds char16_t and char32_t,
for characters of predictable size. The next version of our library
includes these, using UTF-16 for char16_t, UTF-32 for char32_t, and
conversions to UTF-8 for both. FWIW, these types are also being
proposed for C++0X.

What is wrong with uint16_t and uint32_t from inttypes.h of ISO C99
together with a requirement these types be mandatory?

--
John Skaller <skaller at users dot sourceforge dot net>
Async P/L, Realtime software consultants
Felix for C/C++ programmers http://felix.sourceforge.net


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
frege
Guest





PostPosted: Sat Mar 11, 2006 3:33 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

"P.J. Plauger" wrote:
Quote:
"skaller" <skaller (AT) users (DOT) sourceforge.net> wrote in message
news:pan.2006.03.10.16.40.26.209033 (AT) users (DOT) sourceforge.net...

On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

See TR19769, a C Technical Report that adds char16_t and char32_t,
for characters of predictable size. The next version of our library
includes these, using UTF-16 for char16_t, UTF-32 for char32_t, and
conversions to UTF-8 for both. FWIW, these types are also being
proposed for C++0X.

What is wrong with uint16_t and uint32_t from inttypes.h of ISO C99
together with a requirement these types be mandatory?

Not much. From TR19769:

: 3 The new typedefs
:
: This Technical Report introduces the following two new typedefs,
: char16_t and char32_t:
:
: typedef T1 char16_t;
: typedef T2 char32_t;
:
: where T1 has the same type as uint_least16_t and T2 has the same
: type as uint_least32_t.



I'd like that char types to be distinct types, not the same as ints.
For example:

stream >> char16

should not do the same thing as

stream >> int16

(in fact, I'd be tempted to say stream >> char16, or at least stream >>
utf16 shouldn't even work, because you can't guarantee the unicode char
fits into a single utf16 code point. But I'd like stream >> utf32 to
read the 4 bytes directly).



:
Quote:
: The new typedefs guarantee certain widths for the data types,
: whereas the width of wchar_t is implementation defined. The data
: values are unsigned, while char and wchar_t could take signed
: values.
:
: This Technical Report also introduces the new header:
:
: <uchar.h
:
: The new typedefs, char16_t and char32_t, are defined in
: <uchar.h>.

The TR also defines literals for these types:

u'x' u"abc"
U'x' U"abc"

and conversion functions (of the mbrto* and *tomb variety)
for converting between byte sequences and these character
sequences. (And there are feature-test macros to ensure
that the encodings are really UTF-16 and UFT-32, to round
out the functionality.)

I'd say that's not much different than what you're asking
for.

BTW, support for <uchar.h> is in our new library, and EDG
should be shipping a front end that supports the literals
in a matter of weeks.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
P.J. Plauger
Guest





PostPosted: Sat Mar 11, 2006 9:06 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

"frege" <gottlobfrege (AT) gmail (DOT) com> wrote in message
news:1142042976.850744.202750 (AT) u72g2000cwu (DOT) googlegroups.com...

Quote:
"P.J. Plauger" wrote:
"skaller" <skaller (AT) users (DOT) sourceforge.net> wrote in message
news:pan.2006.03.10.16.40.26.209033 (AT) users (DOT) sourceforge.net...

On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

See TR19769, a C Technical Report that adds char16_t and char32_t,
for characters of predictable size. The next version of our library
includes these, using UTF-16 for char16_t, UTF-32 for char32_t, and
conversions to UTF-8 for both. FWIW, these types are also being
proposed for C++0X.

What is wrong with uint16_t and uint32_t from inttypes.h of ISO C99
together with a requirement these types be mandatory?

Not much. From TR19769:

: 3 The new typedefs
:
: This Technical Report introduces the following two new typedefs,
: char16_t and char32_t:
:
: typedef T1 char16_t;
: typedef T2 char32_t;
:
: where T1 has the same type as uint_least16_t and T2 has the same
: type as uint_least32_t.



I'd like that char types to be distinct types, not the same as ints.
For example:

stream >> char16

should not do the same thing as

stream >> int16

And there is a proposal from Sun to do just that in C++.

Quote:
(in fact, I'd be tempted to say stream >> char16, or at least stream
utf16 shouldn't even work, because you can't guarantee the unicode char
fits into a single utf16 code point. But I'd like stream >> utf32 to
read the 4 bytes directly).

Those are indeed the sorts of issues that have to be worked out if
the char16/32_t TR is more fully integrated into C++.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Greg Herlihy
Guest





PostPosted: Sun Mar 12, 2006 5:48 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

Dietmar Kuehl wrote:
Quote:
Pete Becker wrote:
If wchar_t is
32 bits wide it certainly can be used for Unicode. You have to have a
locale that supports whatever flavor of fixed-width encoding you're
interested in using.

Sadly, Unicode is not a fixed-width character representation [anymore]:
due to the presence of "combining characters" you may have characters
which require multiple words to be represented. Thus, I would still
call the internal representation of Unicode characters a "variable
width encoding" although I agree that the internal represention of
characters and character sequences are whatever the implementation
considers reasonable. The effect is that the C++ standard library
indeed does not provide the necessary tools to effectively process
internationalized characters. However, I'm not sure that other
libraries address these issues.

A "variable width" encoding is one in which individual characters in an
encoded string cannot be unambiguously identified without reference to
one or more other characters in the string (and in most encoding
schemes, the characters preceding the one being examined must be
consulted). Since every character in a UTF-32 string - including
"combining" marks - can be unambiguously identified without reference
to any other character in the string, UTF-32 is a fixed width - and not
a variable width - character encoding.

Granted, the semantic meaning of a character within a string depends
upon its context - but this is true of any character, and not just of
the combining marks. For example, a combining (or "non-advancing")
umluat, "¨" followed by a "u" certainly could be semantically
equivalent to a ü, just as a "u" followed by an "e" could be the
semantic equivalent of ü in some contexts. But from the string's point
of view the umlaut, "u", "e" are all of the same "width" and each can
be independently identified without reference to any other character in
the string.

More formally, it is possible to divide a fixed-width encoded string
between any two of its "characters" and have the union of the
characters contained in each divided part be exactly the same set of
characters contained in the whole string. And while a UTF-32 string
passes this division test, a UTF-16 string does not.

Since 16 bits are not enough to represent all 1,114,111 possible
Unicode code points - the UTF-16 encoding must use two, 16 bit
characters (called a "surrogate" pair) to represent some code points
(namely those above 0xFFFF). Dividing a UTF-16 encoded string between
the two characters of a surrogate pair would leave the holder of each
divided part unable to recover the code point for the surrogate pair
that had been split. Consequently, a union of the characters contained
in the two, divided parts would be missing one character present in the
undivided string. UTF-16 strings is therefore a variable - and not a
fixed width - character encoding.

As a variable-width encoding, though, UTF-16 is remarkably robust. Note
for example that when combining the characters found in two parts of
the divided UTF-16 string, no new character that was not present in the
whole string, will ever be added. In other words, even though the
surrogate pair has been divided across two strings, each half is still
recognizable as being half of a surrogate pair. Neither character in a
surrogate pair shares the same value as some other, single character
UTF-16 value, so neither is mistaken for being a character that it is
not. In contrast, most other variable width encodings - Shift-JIS for
example - are far more susceptible to decoding errors - since the value
of a single character could represent one of any number of possible,
valid characters - and only the character's context can resolve the
ambiguity. Errors are therefore harder to detect, since even when
placed in the wrong context, a character sequence could still appear to
be valid.

The question for C++ is to decide what role these Unicode encodings
play in any future, standards support for Unicode. For example, were
C++ to support UTF-32 and UTF-16 encoded strings explicitly, then such
support should take into account the fixed vs. variable width encoding
of each format. In particular a UTF-16 encoded string would treat a
surrogate pair as a single, indivisible character.

But instead of defining Unicode string classes with particular
encodings, perhaps a better way to add Unicode support to C++ would be
to define a single, all-purpose, Unicode string class. This class could
use its own, implementation-defined internal scheme for storing its
characters (and therefore could be optimized for space or speed). But
through its public interface this Unicode string class would present
itself as a sequence of char32_t character values, which can either be
manipulated individually or serve collectively as output to a
std::ostream in any of several, user-specified encodings.

Greg


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
skaller
Guest





PostPosted: Tue Mar 14, 2006 7:06 am    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

On Sat, 11 Mar 2006 20:42:15 +0000, P.J. Plauger wrote:

Quote:
"frege" <gottlobfrege (AT) gmail (DOT) com> wrote in message
news:1142042976.850744.202750 (AT) u72g2000cwu (DOT) googlegroups.com...

"P.J. Plauger" wrote:
"skaller" <skaller (AT) users (DOT) sourceforge.net> wrote in message
news:pan.2006.03.10.16.40.26.209033 (AT) users (DOT) sourceforge.net...

On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

I'd like that char types to be distinct types, not the same as ints.
For example:

stream >> char16

should not do the same thing as

stream >> int16

And there is a proposal from Sun to do just that in C++.

I'm curious how this works. This issue was always a serious wart
for C++. I doesn't seem feasible to solve this problem with
yet more fundamental datatypes.

So the solution has to come with iostream management?

--
John Skaller <skaller at users dot sourceforge dot net>
Async P/L, Realtime software consultants
Felix for C/C++ programmers http://felix.sourceforge.net


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
skaller
Guest





PostPosted: Tue Mar 14, 2006 4:06 pm    Post subject: Re: wchar_t,UTF-8, UTF-16, UTF-32 convertability Reply with quote

On Fri, 10 Mar 2006 18:17:07 +0000, P.J. Plauger wrote:

Quote:
"skaller" <skaller (AT) users (DOT) sourceforge.net> wrote in message
news:pan.2006.03.10.16.40.26.209033 (AT) users (DOT) sourceforge.net...

On Fri, 10 Mar 2006 07:30:31 +0000, P.J. Plauger wrote:

: This Technical Report introduces the following two new typedefs,
: char16_t and char32_t:
:
: typedef T1 char16_t;
: typedef T2 char32_t;
:
: where T1 has the same type as uint_least16_t and T2 has the same
: type as uint_least32_t.

Ok, so no promise that modular arithmetic has a particular base.
Which seems fair for characters.

Quote:
The TR also defines literals for these types:

u'x' u"abc"
U'x' U"abc"


Any specification as to the meaning of \U and \u escapes
in these kinds of literals? Eg can one be sure

u'x' == u'\uXXXX' == 0xXXXX

where XXXX is the code point for x (which I don't know off hand Smile
And for \x escapes what happens?

[Sorry for picking your brain :]

--
John Skaller <skaller at users dot sourceforge dot net>
Async P/L, Realtime software consultants
Felix for C/C++ programmers http://felix.sourceforge.net


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language, library and standards All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.