C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

UTF8 and std::string
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Dave
Guest





PostPosted: Thu Jun 08, 2006 3:20 am    Post subject: UTF8 and std::string Reply with quote



A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?

I'm really curious how the char_traits template was implemented to
handle variable length character encodings.

Thanks,
Dave


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Tom Widmer
Guest





PostPosted: Fri Jun 09, 2006 9:10 am    Post subject: Re: UTF8 and std::string Reply with quote



Dave wrote:
Quote:
A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?

I'm really curious how the char_traits template was implemented to
handle variable length character encodings.

std::basic_string and std::char_traits only operate on fixed width
encodings. The general std approach is to only use variable length
encodings in storage, converting them to and from fixed length when
performing IO (using a codecvt facet).

OTOH, lots of other string libraries do handle UTF8 strings, just not
std::basic_string.

Tom

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
kanze
Guest





PostPosted: Sat Jun 10, 2006 3:14 am    Post subject: Re: UTF8 and std::string Reply with quote



Bronek Kozicki wrote:
Quote:
Dave wrote:
A few weeks ago I looked for an implementation of std::string that
can handle UTF8 strings. I was thinking that the STL iterator
abstraction

I suggest that for your normal data processing needs you stick with
fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring
implementations directly support one or another. Use UTF8 only for
input/output using IO specific for your platform and/or its support
functions. The reason is simple - efficiency.

I'm not sure I agree. I think a lot depends on the application. For a
large set of applications, I'm pretty sure that UTF-8 strings would be
more efficient. With the correct supporting tools (e.g. a regex class
which understands them), they probably wouldn't be any harder to use.
The one case where they really loose is with random access based
strictly on the character index, e.g. accessing the 132nd character in
a
string (without accessing any of the intermediate characters). But if
my applications are typical, that's something that you never do --
outside of an editor, when would you do something like that?

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
jrm
Guest





PostPosted: Sat Jun 10, 2006 4:37 am    Post subject: Re: UTF8 and std::string Reply with quote

Hi,

Recently I stumbled onto this class:

http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html

The interface looks very similar to std::string but I haven't tried it.

Ravi

Dave wrote:
Quote:
A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction
would be nice for iterating over a variable length encoded string. So
far I haven't found anything. Does anybody know of a UTF8 std::string
implementation?

I'm really curious how the char_traits template was implemented to
handle variable length character encodings.

Thanks,
Dave

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Dave
Guest





PostPosted: Sat Jun 10, 2006 4:43 am    Post subject: Re: UTF8 and std::string Reply with quote

Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.

I like this group. There's always good answers in here. Thanks again.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
jrm
Guest





PostPosted: Sun Jun 11, 2006 12:37 am    Post subject: Re: UTF8 and std::string Reply with quote

std::wstring might not be a good idea according to the details section
here from ustring class:

<snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>

In a perfect world the C++ Standard Library would contain a UTF-8
string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
all. Note that std::wstring is not a UTF-8 string class because it
contains only fixed-width characters (where width could be 32, 16, or
even 8 bits).

</snip>

Dave wrote:
Quote:
Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.

I like this group. There's always good answers in here. Thanks again.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Wu Yongwei
Guest





PostPosted: Sun Jun 11, 2006 12:48 am    Post subject: Re: UTF8 and std::string Reply with quote

Dave wrote:
Quote:
Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.

I like this group. There's always good answers in here. Thanks again.

A gotcha under Windows: wchar_t is 2 bytes wide. Depending on your
application, it might or might not have impacts.

ICU is a more robust way to treat UNICODE characters, I believe.

Best regards,

Yongwei


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Jeff Koftinoff
Guest





PostPosted: Sun Jun 11, 2006 12:52 am    Post subject: Re: UTF8 and std::string Reply with quote

Bronek Kozicki wrote:
Quote:
Dave wrote:
A few weeks ago I looked for an implementation of std::string that can
handle UTF8 strings. I was thinking that the STL iterator abstraction

I suggest that for your normal data processing needs you stick with
fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring
implementations directly support one or another. Use UTF8 only for
input/output using IO specific for your platform and/or its support functions.
The reason is simple - efficiency.



But UTF-16 and UTF-32 both are potentially multi-code-point per
character encodings... See the "Grapheme Boundaries" section of:
http://www.unicode.org/unicode/uni2book/ch05.pdf

And from:

http://www.unicode.org/reports/tr19/tr19-9.html

| In any event, however, Unicode code points do not necessarily match
user-expectations for
| "characters". For example, the following are not represented by a
single code point: a
| combining character sequences such as <g, acute>; a conjoining jamo
sequence; or the
| Devanagari conjunct "ksha". These are better matched by grapheme
boundaries, as
| explained in Chapter 5, Implementation Guidelines and in UTR #18:
Unicode Regular >
| Expression Guidelines.

--jeffk++


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Alf P. Steinbach
Guest





PostPosted: Mon Jun 12, 2006 2:28 am    Post subject: Re: UTF8 and std::string Reply with quote

* jrm:
Quote:
std::wstring might not be a good idea according to the details section
here from ustring class:

snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details

I see nothing there that says std::wstring with UTF-16 or UTF-32 would
be a bad choice.

However, if more than 16-bit Unicode (the original Unicode, now the
Basic Multilingual Plane of full Unicode) is required, then on a C++
implementation with 16-bit wchar_t -- such as a Windows C++ compiler
-- a std::wstring has the same potential problem as a std::string has
with UTF-8, that it doesn't support the variable length encoding.

On the third hand, if the platform is exclusively Windows (NT family),
then std::wstring corresponds directly to what's required for system
calls, so that in most cases no conversion is required, either way.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Bronek Kozicki
Guest





PostPosted: Mon Jun 12, 2006 2:33 am    Post subject: Re: UTF8 and std::string Reply with quote

jrm wrote:
Quote:
std::wstring might not be a good idea according to the details section
here from ustring class:

why not? std::wstring is typicaly implemented on top of Unicode support of
target platform, and character type used is typically some fixed-width Unicode
encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
other flavours of Unix). UTF8 is not character type (neither UTF16 or UTF32
are, but at least they are fixed width, so they can map to wchar_t) but fancy
encoding. And typical location of data encoding is not in data processing, but
input/output. Anything that can be represented in UTF8 can be also represented
in UTF32 and in UTF16 (or almost anything - there are surrogates to compensate
shorter characters in UTF16, but I'm not sure how much value they provide)


B.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Bronek Kozicki
Guest





PostPosted: Mon Jun 12, 2006 2:34 am    Post subject: Re: UTF8 and std::string Reply with quote

Jeff Koftinoff wrote:
Quote:
But UTF-16 and UTF-32 both are potentially multi-code-point per
character encodings... See the "Grapheme Boundaries" section of:

they are best one can get now.


B.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Pete Becker
Guest





PostPosted: Mon Jun 12, 2006 2:40 am    Post subject: Re: UTF8 and std::string Reply with quote

Wu Yongwei wrote:

Quote:

A gotcha under Windows: wchar_t is 2 bytes wide.


wchar_t is a type defined by the compiler. For some Windows compilers
it's 2 bytes wide, for others it isn't.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Pete Becker
Guest





PostPosted: Mon Jun 12, 2006 2:41 am    Post subject: Re: UTF8 and std::string Reply with quote

jrm wrote:

Quote:
std::wstring might not be a good idea according to the details section
here from ustring class:

snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details

In a perfect world the C++ Standard Library would contain a UTF-8
string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
all. Note that std::wstring is not a UTF-8 string class because it
contains only fixed-width characters (where width could be 32, 16, or
even 8 bits).

/snip


Back in the olden days, the Japanese tried to work with multi-byte
representations of Japanese characters. The result of that experience
was that they insisted that C add wide character support so they
wouldn't have to.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Eugene Gershnik
Guest





PostPosted: Wed Jun 14, 2006 3:33 am    Post subject: Re: UTF8 and std::string Reply with quote

Bronek Kozicki wrote:
Quote:
jrm wrote:
std::wstring might not be a good idea according to the details section
here from ustring class:

why not? std::wstring is typicaly implemented on top of Unicode support of
target platform, and character type used is typically some fixed-width Unicode
encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
other flavours of Unix).

wchar_t is locale dependent on Solaris. It is UTF-32 for UTF-8 locales
and something proprietary on others. This question has been beaten to
death in this NG in the past. The simple conclusion is standard C++
wchar_t != Unicode. IIRC P.J. Plauger once explained here why it should
be considered a good thing.

Quote:
UTF8 is not character type (neither UTF16 or UTF32
are, but at least they are fixed width, so they can map to wchar_t) but fancy
encoding.

UTF-16 is *not* fixed width. It is a variable width encoding where a
Unicode character can be represented by 1 or 2 16-bit units. At least
this was so last time I checked. I wouldn't be suprised if some new
Unicode standard broke it further.

UTF-32 is the only fixed length encoding for Unicode available today.
Again see caveat above. It is also very wasteful if the bulk of your
text processing is ASCII compatible. (note that 4 bytes is the *worst*
case for UTF-Cool.

UTF-8 has special properties that make it very attractive for many
applications. In particular it guarantees that no byte of multi-byte
entry corresponds to a standalone single byte. Thus with UTF-8 you can
still search for english only strings (like /, \\ or .) using
single-byte algorithms like strchr().
It is also can be used (with caution) with std::string unlike UTF-16
and UTF-32 for which you will have to invent a character type and write
traits.
IMO UTF-8 (and UTF-8 locales) is probably the best way to use Unicode
on Unix. Apparently I am also backed by known experts
http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

UTF-16 is a good option on platforms that directly support it like
Windows, AIX or Java. UTF-32 is probably not a good option anywhere ;-)

--
Eugene


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
kanze
Guest





PostPosted: Wed Jun 14, 2006 3:31 pm    Post subject: Re: UTF8 and std::string Reply with quote

Pete Becker wrote:
Quote:
Wu Yongwei wrote:

A gotcha under Windows: wchar_t is 2 bytes wide.

wchar_t is a type defined by the compiler. For some Windows
compilers it's 2 bytes wide, for others it isn't.

Is that true? I'm not that familiar with the Windows world, but
I know that a compiler for a given platform doesn't have
unlimited freedom. At the very least, it must be compatible
with the system API. (Not according to the standard, of course,
but practically, to be usable.) And I was under the impression
that the Windows API (unlike Unix) used wchar_t in some places.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.