 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
YACode Guest
|
Posted: Wed Oct 27, 2004 12:04 am Post subject: std:: string and UNICODE issues |
|
|
It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and non-unicode?
Even std::exception does not recognize wchar_t exceptions?
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Bob Hairgrove Guest
|
Posted: Wed Oct 27, 2004 2:01 pm Post subject: Re: std:: string and UNICODE issues |
|
|
On 26 Oct 2004 20:04:19 -0400, [email]yetanothercoder (AT) hotmail (DOT) com[/email] (YACode)
wrote:
| Quote: | It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and non-unicode?
Even std::exception does not recognize wchar_t exceptions?
|
std::exception is ... well, an exception <g>. Most of the time, you
can use std::wstring and do something like this in a header file
included wherever you need it:
#ifdef UNICODE
typedef std::wstring tstring;
#else
typedef std::string tstring;
#endif
// same for stringstream, etc.
But you are right about std::exception. There was a thread about it
recently here (subject was: '"what_arg" parameter of the standard
exception class constructors'). You basically have to define your
own exception class if you need wide-string support.
--
Bob Hairgrove
[email]NoSpamPlease (AT) Home (DOT) com[/email]
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ulrich Eckhardt Guest
|
Posted: Wed Oct 27, 2004 2:20 pm Post subject: Re: std:: string and UNICODE issues |
|
|
YACode wrote:
| Quote: | It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and
non-unicode?
|
Sure, you can easily store e.g. UTF-8 in it, and even in a multi-lingual
environment, there may be valid reasons to use ASCII.
Anyhow, what do you mean with switching 'between unicode and non-unicode'?
In case you refer to _UNICODE and UNICODE macros used by the win32 API,
those only switch TCHAR between char using codepages and wchar_t using
UTF-16[1]. This is one particular API only, and not a feature of standard
C++. That being said, std::wstring is as suitable for storing Unicode
strings as std::string in that particular environment.
| Quote: | Even std::exception does not recognize wchar_t exceptions?
|
The value returned by exception::what() is not intended for the user. At
most, it is intended as a key into a database with meaningful
errormessages.
Uli
[1]: AFAIK, MS Windows NT4 only used UCS-2 and MS Windows CE only supports
the wchar_t parts of the win32 API.
--
FAQ: http://parashift.com/c++-faq-lite/
/* bittersweet C++ */
default: break;
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Antoun Kanawati Guest
|
Posted: Wed Oct 27, 2004 2:22 pm Post subject: Re: std:: string and UNICODE issues |
|
|
YACode wrote:
| Quote: | It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and non-unicode?
|
Do you mean std::string or std::basic_string?
The first is essentially strings of 1-byte characters. The second
is a template which lets you use any sort of 'character', with any
sort of interpretation.
I will not claim that implementing unicode on top of basic_string
is an easy task, but surely it has been done at least once.
| Quote: | Even std::exception does not recognize wchar_t exceptions?
|
Quite an interesting oversight in an international standard.
--
A. Kanawati
[email]NO.antounk.SPAM (AT) comcast (DOT) net[/email]
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Vitaly Repin Guest
|
Posted: Thu Oct 28, 2004 1:48 pm Post subject: Re: std:: string and UNICODE issues |
|
|
Hello!
[email]yetanothercoder (AT) hotmail (DOT) com[/email] (YACode) wrote in message news:<d60f26c1.0410261515.148cbb1 (AT) posting (DOT) google.com>...
| Quote: | It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and non-unicode?
|
You should use std::wstring instead.
| Quote: | Even std::exception does not recognize wchar_t exceptions?
|
Yes, it doesn't.
Good bye!
--
WBR & WBW, Vitaly
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Abhishek Pandey Guest
|
Posted: Thu Oct 28, 2004 1:51 pm Post subject: Re: std:: string and UNICODE issues |
|
|
"YACode" <yetanothercoder (AT) hotmail (DOT) com> wrote
| Quote: | It appears that this class is nearly useless if you want to compile/work
in a multi-lingual environment and/or switch between unicode and
non-unicode? |
std::string is a typedef for basic_string<char>.
for holding double byte characters you can define something like
basic_string<wchar_t>.
but please note that there is much much more to Unicode than just two bytes.
If you want to work in true multi-lingual environment then
you may have a look at some string libraries which handle Unicode characters
properly.
ICU library is one such library which provides you unicode string handling
and much more. There is a class called UnicodeString in the library
that you can use to ACTUALLY handle unicode characters.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ioannis Vranos Guest
|
Posted: Thu Oct 28, 2004 2:10 pm Post subject: Re: std:: string and UNICODE issues |
|
|
Antoun Kanawati wrote:
| Quote: | Do you mean std::string or std::basic_string?
The first is essentially strings of 1-byte characters. The second
is a template which lets you use any sort of 'character', with any
sort of interpretation.
|
What about wstring (which is typedef basic_string<wchar_t> wstring; )?
--
Ioannis Vranos
http://www23.brinkster.com/noicys
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Antoun Kanawati Guest
|
Posted: Fri Oct 29, 2004 2:20 pm Post subject: Re: std:: string and UNICODE issues |
|
|
Ioannis Vranos wrote:
| Quote: | Antoun Kanawati wrote:
Do you mean std::string or std::basic_string?
The first is essentially strings of 1-byte characters. The second
is a template which lets you use any sort of 'character', with any
sort of interpretation.
What about wstring (which is typedef basic_string<wchar_t> wstring; )?
|
This one does multi-byte characters, but remains much like the 1-byte
version. It may be suitable for storage, but it is not Unicode.
The second, and oft-forgotten, argument to the template is the traits
class. That's where 'Unicode' would go, assuming it can be done within
a character-traits interface. From limited exposure to these matters,
I know that there is much tedium, pain, and suffering involved in
doing this sort of stuff (the average i18n library weighs a few
megabytes, and the rules are mind-numbing).
--
A. Kanawati
[email]NO.antounk.SPAM (AT) comcast (DOT) net[/email]
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ioannis Vranos Guest
|
Posted: Sun Oct 31, 2004 11:01 am Post subject: Re: std:: string and UNICODE issues |
|
|
Antoun Kanawati wrote:
| Quote: | This one does multi-byte characters, but remains much like the 1-byte
version. It may be suitable for storage, but it is not Unicode.
|
wchar_t is the largest character set supported in a system. In my system
for example (.NET/Windows) it is Unicode.
--
Ioannis Vranos
http://www23.brinkster.com/noicys
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Antoun Kanawati Guest
|
Posted: Mon Nov 01, 2004 11:53 am Post subject: Re: std:: string and UNICODE issues |
|
|
Ioannis Vranos wrote:
| Quote: | Antoun Kanawati wrote:
This one does multi-byte characters, but remains much like the 1-byte
version. It may be suitable for storage, but it is not Unicode.
wchar_t is the largest character set supported in a system. In my system
for example (.NET/Windows) it is Unicode.
|
It's only a multi-byte word. It's unicode when it's associated with a
specific interpretation of the characters. Without the interpretation,
it is merely a storage mehchanism that is valid for unicode.
You still need a significant amount of machinery to insure that every
string you build is a valid internal unicode representation of the
input it was constructed from.
In a closed environments, where everyting is Unicode, this may be a
non-problem. But, the moment you start consuming data through unblessed
interfaces (e.g.: a socket stream), you are stuck with the problem of
insuring, or converting to, valid form before you can assert that all
your basic_string<wchar_t> hold valid unicode representations of the
inputs.
--
A. Kanawati
[email]NO.antounk.SPAM (AT) comcast (DOT) net[/email]
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Maxim Yegorushkin Guest
|
Posted: Mon Nov 01, 2004 12:17 pm Post subject: Re: std:: string and UNICODE issues |
|
|
On 31 Oct 2004 06:01:26 -0500, Ioannis Vranos <ivr (AT) guesswh (DOT) at.grad.com>
wrote:
| Quote: | wchar_t is the largest character set supported in a system. In my system
for example (.NET/Windows) it is Unicode.
|
Not quite so.
Windows is dealing with UTF-16 and it does have surrogate pairs. That
means a 2-byte windows wchar_t can not possibly store any UTF-16 code
point.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp
--
Maxim Yegorushkin
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ulrich Eckhardt Guest
|
Posted: Mon Nov 01, 2004 10:23 pm Post subject: Re: std:: string and UNICODE issues |
|
|
Ioannis Vranos wrote:
| Quote: | Antoun Kanawati wrote:
This one does multi-byte characters, but remains much like the 1-byte
version. It may be suitable for storage, but it is not Unicode.
wchar_t is the largest character set supported in a system. In my system
for example (.NET/Windows) it is Unicode.
|
You can use an 8-bit character and still 'be Unicode', like BeOS does. The
problem is, that with a 16-bit wchar_t(as is present on e.g. win32), you
sometimes need more than one of these to represent a single character( IOW
a single Unicode codepoint). This is a problem, because 'some_str[42]' does
not necessarily mean the 42nd character of the string, but just the 42nd
element of the container used as storage.
Uli
--
FAQ: http://parashift.com/c++-faq-lite/
/* bittersweet C++ */
default: break;
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Dave Harris Guest
|
Posted: Tue Nov 02, 2004 9:35 pm Post subject: Re: std:: string and UNICODE issues |
|
|
[email]doomster (AT) knuut (DOT) de[/email] (Ulrich Eckhardt) wrote (abridged):
| Quote: | The problem is, that with a 16-bit wchar_t(as is present on e.g.
win32), you sometimes need more than one of these to represent a
single character(IOW a single Unicode codepoint). This is a
problem, because 'some_str[42]' does not necessarily mean the
42nd character of the string, but just the 42nd element of
the container used as storage.
|
As I understand it, Unicode code points don't represent characters either.
For example, the canonical representation of u-with-an-accent is two code
points: U+0075 followed by U+0308. So the 42nd code point may not be the
42nd logical character. UTF-32 is a variable-width format, just like
UTF-16 and UTF-8.
In practice I doubt one would ever want to access some_str[42]; unless the
42 had been found by a previous sweep through the string, in which case it
would presumably always point to the start of a logical character in any
of the encodings.
-- Dave Harris, Nottingham, UK
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ulrich Eckhardt Guest
|
Posted: Wed Nov 03, 2004 9:59 am Post subject: Re: std:: string and UNICODE issues |
|
|
Dave Harris wrote:
| Quote: | As I understand it, Unicode code points don't represent characters either.
For example, the canonical representation of u-with-an-accent is two code
points: U+0075 followed by U+0308. So the 42nd code point may not be the
42nd logical character. UTF-32 is a variable-width format, just like
UTF-16 and UTF-8.
|
There might be a slight difference: if you insert some element in the middle
of a multi-element sequence(be it UTF-8 multibyte sequence or UTF-16
surrogate pairs), the resulting string is invalid. Inserting it between an
accent and the letter it was on might not make sense, but as least it will
remain a valid string.
Uli
--
FAQ: http://parashift.com/c++-faq-lite/
/* bittersweet C++ */
default: break;
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Thu Nov 04, 2004 12:05 pm Post subject: Re: std:: string and UNICODE issues |
|
|
[email]brangdon (AT) cix (DOT) co.uk[/email] (Dave Harris) wrote in message
news:<memo.20041102204220.1304A (AT) brangdon (DOT) m>...
| Quote: | doomster (AT) knuut (DOT) de (Ulrich Eckhardt) wrote (abridged):
The problem is, that with a 16-bit wchar_t(as is present on e.g.
win32), you sometimes need more than one of these to represent a
single character(IOW a single Unicode codepoint). This is a problem,
because 'some_str[42]' does not necessarily mean the 42nd character
of the string, but just the 42nd element of the container used as
storage.
As I understand it, Unicode code points don't represent characters either.
|
It depends on your definition of a character.
| Quote: | For example, the canonical representation of u-with-an-accent is two
code points: U+0075 followed by U+0308. So the 42nd code point may not
be the 42nd logical character. UTF-32 is a variable-width format, just
like UTF-16 and UTF-8.
|
The canonical representation for this letter would be the single code
point u00DC. But you're right in saying that the issues are more
complicated than they seem, see
http://www.unicode.org/unicode/reports/tr15/, for example. Including
its annex 7 -- my interpretation is that the annex E of the C++ standard
is in conflict with it. (According to the Unicode document, the
sequence u0075u0308 is equivalent to u00DC, and an identifier
containing one should compare equal to an identifier containing the
other. Annex E of the C++ standard, however, doesn't allow u0308 in an
identifier, regardless. And of course, which one is actually in the
source file depends on the configuration of the editor, and you can't
distinguish when the file is displayed as text.)
--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|