 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Dave Guest
|
Posted: Thu Jun 08, 2006 3:20 am Post subject: UTF8 and std::string |
|
|
A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?
I'm really curious how the char_traits template was implemented to
handle variable length character encodings.
Thanks,
Dave
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Tom Widmer Guest
|
Posted: Fri Jun 09, 2006 9:10 am Post subject: Re: UTF8 and std::string |
|
|
Dave wrote:
| Quote: | A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?
I'm really curious how the char_traits template was implemented to
handle variable length character encodings.
|
std::basic_string and std::char_traits only operate on fixed width
encodings. The general std approach is to only use variable length
encodings in storage, converting them to and from fixed length when
performing IO (using a codecvt facet).
OTOH, lots of other string libraries do handle UTF8 strings, just not
std::basic_string.
Tom
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
kanze Guest
|
Posted: Sat Jun 10, 2006 3:14 am Post subject: Re: UTF8 and std::string |
|
|
Bronek Kozicki wrote:
| Quote: | Dave wrote:
A few weeks ago I looked for an implementation of std::string that
can handle UTF8 strings. I was thinking that the STL iterator
abstraction
I suggest that for your normal data processing needs you stick with
fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring
implementations directly support one or another. Use UTF8 only for
input/output using IO specific for your platform and/or its support
functions. The reason is simple - efficiency.
|
I'm not sure I agree. I think a lot depends on the application. For a
large set of applications, I'm pretty sure that UTF-8 strings would be
more efficient. With the correct supporting tools (e.g. a regex class
which understands them), they probably wouldn't be any harder to use.
The one case where they really loose is with random access based
strictly on the character index, e.g. accessing the 132nd character in
a
string (without accessing any of the intermediate characters). But if
my applications are typical, that's something that you never do --
outside of an editor, when would you do something like that?
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
jrm Guest
|
Posted: Sat Jun 10, 2006 4:37 am Post subject: Re: UTF8 and std::string |
|
|
Hi,
Recently I stumbled onto this class:
http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html
The interface looks very similar to std::string but I haven't tried it.
Ravi
Dave wrote:
| Quote: | A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?
I'm really curious how the char_traits template was implemented to
handle variable length character encodings.
Thanks,
Dave
|
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Dave Guest
|
Posted: Sat Jun 10, 2006 4:43 am Post subject: Re: UTF8 and std::string |
|
|
Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.
I like this group. There's always good answers in here. Thanks again.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
jrm Guest
|
Posted: Sun Jun 11, 2006 12:37 am Post subject: Re: UTF8 and std::string |
|
|
std::wstring might not be a good idea according to the details section
here from ustring class:
<snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>
In a perfect world the C++ Standard Library would contain a UTF-8
string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
all. Note that std::wstring is not a UTF-8 string class because it
contains only fixed-width characters (where width could be 32, 16, or
even 8 bits).
</snip>
Dave wrote:
| Quote: | Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.
I like this group. There's always good answers in here. Thanks again.
|
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Wu Yongwei Guest
|
Posted: Sun Jun 11, 2006 12:48 am Post subject: Re: UTF8 and std::string |
|
|
Dave wrote:
| Quote: | Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.
I like this group. There's always good answers in here. Thanks again.
|
A gotcha under Windows: wchar_t is 2 bytes wide. Depending on your
application, it might or might not have impacts.
ICU is a more robust way to treat UNICODE characters, I believe.
Best regards,
Yongwei
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Jeff Koftinoff Guest
|
Posted: Sun Jun 11, 2006 12:52 am Post subject: Re: UTF8 and std::string |
|
|
Bronek Kozicki wrote:
| Quote: | Dave wrote:
A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
I suggest that for your normal data processing needs you stick with
fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring
implementations directly support one or another. Use UTF8 only for
input/output using IO specific for your platform and/or its support functions.
The reason is simple - efficiency.
|
But UTF-16 and UTF-32 both are potentially multi-code-point per
character encodings... See the "Grapheme Boundaries" section of:
http://www.unicode.org/unicode/uni2book/ch05.pdf
And from:
http://www.unicode.org/reports/tr19/tr19-9.html
| In any event, however, Unicode code points do not necessarily match
user-expectations for
| "characters". For example, the following are not represented by a
single code point: a
| combining character sequences such as <g, acute>; a conjoining jamo
sequence; or the
| Devanagari conjunct "ksha". These are better matched by grapheme
boundaries, as
| explained in Chapter 5, Implementation Guidelines and in UTR #18:
Unicode Regular >
| Expression Guidelines.
--jeffk++
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Alf P. Steinbach Guest
|
Posted: Mon Jun 12, 2006 2:28 am Post subject: Re: UTF8 and std::string |
|
|
* jrm:
| Quote: | std::wstring might not be a good idea according to the details section
here from ustring class:
snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details
|
I see nothing there that says std::wstring with UTF-16 or UTF-32 would
be a bad choice.
However, if more than 16-bit Unicode (the original Unicode, now the
Basic Multilingual Plane of full Unicode) is required, then on a C++
implementation with 16-bit wchar_t -- such as a Windows C++ compiler
-- a std::wstring has the same potential problem as a std::string has
with UTF-8, that it doesn't support the variable length encoding.
On the third hand, if the platform is exclusively Windows (NT family),
then std::wstring corresponds directly to what's required for system
calls, so that in most cases no conversion is required, either way.
--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Bronek Kozicki Guest
|
Posted: Mon Jun 12, 2006 2:33 am Post subject: Re: UTF8 and std::string |
|
|
jrm wrote:
| Quote: | std::wstring might not be a good idea according to the details section
here from ustring class:
|
why not? std::wstring is typicaly implemented on top of Unicode support of
target platform, and character type used is typically some fixed-width Unicode
encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
other flavours of Unix). UTF8 is not character type (neither UTF16 or UTF32
are, but at least they are fixed width, so they can map to wchar_t) but fancy
encoding. And typical location of data encoding is not in data processing, but
input/output. Anything that can be represented in UTF8 can be also represented
in UTF32 and in UTF16 (or almost anything - there are surrogates to compensate
shorter characters in UTF16, but I'm not sure how much value they provide)
B.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Bronek Kozicki Guest
|
Posted: Mon Jun 12, 2006 2:34 am Post subject: Re: UTF8 and std::string |
|
|
Jeff Koftinoff wrote:
| Quote: | But UTF-16 and UTF-32 both are potentially multi-code-point per
character encodings... See the "Grapheme Boundaries" section of:
|
they are best one can get now.
B.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Pete Becker Guest
|
Posted: Mon Jun 12, 2006 2:40 am Post subject: Re: UTF8 and std::string |
|
|
Wu Yongwei wrote:
| Quote: |
A gotcha under Windows: wchar_t is 2 bytes wide.
|
wchar_t is a type defined by the compiler. For some Windows compilers
it's 2 bytes wide, for others it isn't.
--
Pete Becker
Roundhouse Consulting, Ltd.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Pete Becker Guest
|
Posted: Mon Jun 12, 2006 2:41 am Post subject: Re: UTF8 and std::string |
|
|
jrm wrote:
| Quote: | std::wstring might not be a good idea according to the details section
here from ustring class:
snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details
In a perfect world the C++ Standard Library would contain a UTF-8
string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
all. Note that std::wstring is not a UTF-8 string class because it
contains only fixed-width characters (where width could be 32, 16, or
even 8 bits).
/snip
|
Back in the olden days, the Japanese tried to work with multi-byte
representations of Japanese characters. The result of that experience
was that they insisted that C add wide character support so they
wouldn't have to.
--
Pete Becker
Roundhouse Consulting, Ltd.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Eugene Gershnik Guest
|
Posted: Wed Jun 14, 2006 3:33 am Post subject: Re: UTF8 and std::string |
|
|
Bronek Kozicki wrote:
| Quote: | jrm wrote:
std::wstring might not be a good idea according to the details section
here from ustring class:
why not? std::wstring is typicaly implemented on top of Unicode support of
target platform, and character type used is typically some fixed-width Unicode
encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
other flavours of Unix).
|
wchar_t is locale dependent on Solaris. It is UTF-32 for UTF-8 locales
and something proprietary on others. This question has been beaten to
death in this NG in the past. The simple conclusion is standard C++
wchar_t != Unicode. IIRC P.J. Plauger once explained here why it should
be considered a good thing.
| Quote: | UTF8 is not character type (neither UTF16 or UTF32
are, but at least they are fixed width, so they can map to wchar_t) but fancy
encoding.
|
UTF-16 is *not* fixed width. It is a variable width encoding where a
Unicode character can be represented by 1 or 2 16-bit units. At least
this was so last time I checked. I wouldn't be suprised if some new
Unicode standard broke it further.
UTF-32 is the only fixed length encoding for Unicode available today.
Again see caveat above. It is also very wasteful if the bulk of your
text processing is ASCII compatible. (note that 4 bytes is the *worst*
case for UTF- .
UTF-8 has special properties that make it very attractive for many
applications. In particular it guarantees that no byte of multi-byte
entry corresponds to a standalone single byte. Thus with UTF-8 you can
still search for english only strings (like /, \\ or .) using
single-byte algorithms like strchr().
It is also can be used (with caution) with std::string unlike UTF-16
and UTF-32 for which you will have to invent a character type and write
traits.
IMO UTF-8 (and UTF-8 locales) is probably the best way to use Unicode
on Unix. Apparently I am also backed by known experts
http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux
UTF-16 is a good option on platforms that directly support it like
Windows, AIX or Java. UTF-32 is probably not a good option anywhere ;-)
--
Eugene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
kanze Guest
|
Posted: Wed Jun 14, 2006 3:31 pm Post subject: Re: UTF8 and std::string |
|
|
Pete Becker wrote:
| Quote: | Wu Yongwei wrote:
A gotcha under Windows: wchar_t is 2 bytes wide.
wchar_t is a type defined by the compiler. For some Windows
compilers it's 2 bytes wide, for others it isn't.
|
Is that true? I'm not that familiar with the Windows world, but
I know that a compiler for a given platform doesn't have
unlimited freedom. At the very least, it must be compatible
with the system API. (Not according to the standard, of course,
but practically, to be usable.) And I was under the impression
that the Windows API (unlike Unix) used wchar_t in some places.
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|