 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Old Wolf Guest
|
Posted: Tue Apr 27, 2004 10:40 am Post subject: deriving from std::moneypunct facet |
|
|
I am attempting to derive a facet from std::moneypunct that uses different
characters for the separators and so on. I have posted the code below.
I am expecting it to display "1c23c45c67p89" . But instead, my
linux/gcc 3.3.1 system displays "123456789" regardless of locale, and my
winnt/bcc 5.5.1 system displays "1,234,567.89" if the locale is "C",
and segfaults in the constructor of std::moneypunct<> if the locale
is anything else. The segfault still occurs if I comment out the
virtual methods in my class.
My compiler documentation included an example of doing the same thing
for "numpunct", which worked correctly on both my systems for locale de_DE
(the code for that is posted below my non-working code, for comparison). I
have tried to copy this working example as closely as possible.
Some other questions:
- is there a book that teaches locales and facets well?
(so far i'm just learning from the compiler documentation)
- what does the Intl template parameter on moneypunct signify exactly?
- how can I find out what locale names (eg. "de_DE") are supported
on my system?
#include <iostream>
#include <exception>
#include <string>
#include <locale>
template<typename charT, bool Intl = false>
class change_sep: public std::moneypunct_byname<charT,Intl>
{
public:
explicit change_sep(const char *name, size_t refs = 0)
: std::moneypunct_byname<charT,Intl>(name, refs) {}
protected:
virtual charT do_thousands_sep() const { return 'c'; }
virtual charT do_decimal_point() const { return 'p'; }
virtual std::string do_grouping() const { return "2"; }
};
template<typename NumType>
std::ostream &price_put(std::ostream &os, NumType num)
{
typedef std::money_put<char> facet_t;
const facet_t &fac = std::use_facet<facet_t>(os.getloc());
fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);
return os;
}
int main()
{
try {
std::locale loc(std::locale("de_DE"),
new change_sep<char, false>("de_DE"));
std::cout.imbue(loc);
price_put(std::cout, 123456789);
}
catch(std::exception &e)
{
std::cout << "Error: " << e.what() << std::endl;
}
return 0;
}
Here is the code example from my compiler documentation, that works OK:
#include
#include <string>
#include <locale>
using namespace std;
template <class charT>
class change_bool_names: public numpunct_byname<charT>
{
public:
typedef basic_string<charT> string_type;
explicit change_bool_names (const char* name,
const charT* t, const charT* f, size_t refs=0)
: numpunct_byname<charT> (name,refs),
true_string(t), false_string(f) { }
protected:
string_type do_truename () const { return true_string; }
string_type do_falsename () const { return false_string; }
private:
string_type true_string, false_string;
};
int main(int argc, char **)
{
locale loc(locale("de_DE"),
new change_bool_names<char>("de_DE","Ja.","Nein."));
cout.imbue(loc);
cout << "Argumente vorhanden? "
<< boolalpha << (argc > 1) << endl;
}
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Paolo Carlini Guest
|
Posted: Wed Apr 28, 2004 9:15 am Post subject: Re: deriving from std::moneypunct facet |
|
|
Hi!
Old Wolf wrote:
| Quote: | - is there a book that teaches locales and facets well?
(so far i'm just learning from the compiler documentation)
|
Langer & Kreft, "Standard C++ IOStreams and Locales" is pretty good.
| Quote: | - what does the Intl template parameter on moneypunct signify exactly?
|
international currency symbol or domestic currency symbol.
| Quote: | - how can I find out what locale names (eg. "de_DE") are supported
on my system?
|
localedef --list-archive for glibc.
| Quote: | fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);
^^^^ |
Just change it consistently to /false/ and it works! Rather nice
example, by the way!
Paolo.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Old Wolf Guest
|
Posted: Thu Apr 29, 2004 11:38 am Post subject: Re: deriving from std::moneypunct facet |
|
|
Paolo Carlini <pcarlini (AT) suse (DOT) de> wrote:
| Quote: |
- how can I find out what locale names (eg. "de_DE") are supported
on my system?
localedef --list-archive for glibc.
|
[OT] Is there anything vaguely portable for this? or some system
calls for common operating systems?
| Quote: | fac.put(std::ostreambuf_iterator<char>(os), true, os, os.fill(), num);
^^^^
Just change it consistently to /false/ and it works! Rather nice
example, by the way!
|
Thanks - I understand now: there are 2 different facets
moneypunct<charT, false> and moneypunct<charT, true>, since I defined
moneypunct<false>, I have to use 'false' as the parameter if I want
to invoke that facet.
I still get my segfault in Windows though, so I suppose that is a
bug in my implementation (it could at least throw an exception about
invalid locale).
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Thu Apr 29, 2004 9:59 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
Paolo Carlini <pcarlini (AT) suse (DOT) de> wrote
| Quote: | Old Wolf wrote:
- is there a book that teaches locales and facets well?
(so far i'm just learning from the compiler documentation)
Langer & Kreft, "Standard C++ IOStreams and Locales" is pretty good.
- what does the Intl template parameter on moneypunct signify
exactly? |
| Quote: | international currency symbol or domestic currency symbol.
|
Where does it say that in the standard? About the closest I could find
is a non normative footnote which says that "for international
instantiations (second template parameter true) this is always four
charactter long, usually three letters and a space", talking about the
return value of do_curr_symbol(). I've probably missed it, but I can
find no normative text whatsoever concerning the meaning of the second
template parameter. It is named International, which is suggestive, but
I'm not sure of what.
In practice, of course, two cases can occur. If I'm working in a closed
environment, with a single currency, then I have no problem (and
presumably, the facet which interests me is the one with Interational ==
false). As soon as the possibility of multiple currencies raises its
head, howver, I need a type which contains not just the amount, but also
the currency. And part of the facet become pretty irrelevant with
regards to the formatting, since the actual format will depend on the
actual currency -- part of the "value" of what is being formatted: the
currency symbol, obviously, but also the number of fractional digits.
| Quote: | - how can I find out what locale names (eg. "de_DE") are supported
on my system?
localedef --list-archive for glibc.
|
It's very system dependent. Posix defines a standard format for naming,
<country_code>_<language_code>.<encoding_name>. But even Posix
compliant systems tend to support a lot of "traditional" names not in
this format, Posix allows for defaults (so that "de" might mean the same
thing as "de_DE.iso_8859_1"), and of course, knowing the format doesn't
tell you whether it is actually available on a given machine. (On my
Posix compliant Solaris machine, the only locales I have available are
"C", "POSIX", "common", "en_US.UTF-8", and ïso_8859_1"; someone removed
the others to make more space on the disk. But it does give you an idea
concerning the usefulness of the standard naming format:-).)
I think that it is usual on Unix machines for the locales to be placed
in a directory called, somewhat strangely, "locale", somewhere under
/usr. I've seen /usr/lib/locale and /usr/share/locale -- the latter is
somewhat surprising, and the locale specific directories contain shared
objects, which are not sharable in the sense used in the directory name,
i.e. between machines running different hardware. Anyway, a little work
with find should do the trick. But I wouldn't like to have to do it
from within a program.
I'm less familiar with Windows. Locale names there tend to correspond
with usual English use: "French", "German", etc. I'm not sure how this
works in practice -- a quick check showed me that in the "French"
locale, the decimal character was '.', which may be true in Quebec, but
is certainly not the usual use in France. So how would you specify the
equivalent of "ch_DE.utf-8" -- Swiss German encoded in UTF-8.
The documentation for Windows says that there are some 100 locales, and
that all of them are always installed. I've been unable to find a
complete list, but I would imagine that it is in the documetation
somewhere.
All in all, we're still not to the point where you can write portable
internationalized code.
--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Michael Karcher Guest
|
Posted: Fri Apr 30, 2004 11:22 am Post subject: Re: deriving from std::moneypunct facet |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | international currency symbol or domestic currency symbol.
Where does it say that in the standard? About the closest I could find
is a non normative footnote which says that "for international
instantiations (second template parameter true) this is always four
charactter long, usually three letters and a space",
|
It's all about "USD " vs. "$". Or as we had in germany before the euro
introduction "DEM " (Intl) vs. "DM" (Domestic). Now it is "EUR " (intl)
vs. the euro symbol, if supported in the character set.
Michael Karcher
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ben Hutchings Guest
|
Posted: Sat May 01, 2004 3:23 am Post subject: Re: deriving from std::moneypunct facet |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
<snip>
| Quote: | I'm less familiar with Windows. Locale names there tend to correspond
with usual English use: "French", "German", etc. I'm not sure how this
works in practice -- a quick check showed me that in the "French"
locale, the decimal character was '.', which may be true in Quebec, but
is certainly not the usual use in France. So how would you specify the
equivalent of "ch_DE.utf-8" -- Swiss German encoded in UTF-8.
|
Internally Windows normally uses numeric locale IDs assigned by
Microsoft, though they do also have names.
The VC++ 7.1 documentation says you can use something similar to the
POSIX format:
lang ["_" country/region ["." code-page]]
So I suppose you would use "German_Switzerland.65001", though UTF-8
doesn't seem to be as fully supported in Windows as the older code
pages that use a maximum of 2 bytes per character.
I have a sneaking suspicion that the country and language names may
themselves be localised according to the system locale, though.
| Quote: | The documentation for Windows says that there are some 100 locales, and
that all of them are always installed.
snip |
This is incorrect. Each version of Windows should recognise all the
locale IDs that were assigned at the time it was built, but the data
for those locales are only selectively installed. The same goes for
code pages and their IDs. There are some Win32 functions that allow
you to enumerate recognised or installed locales, code pages etc.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Eugene Gershnik Guest
|
Posted: Sun May 02, 2004 1:46 am Post subject: Re: deriving from std::moneypunct facet |
|
|
Ben Hutchings wrote:
| Quote: | kanze (AT) gabi-soft (DOT) fr wrote:
snip
So
how would you specify the equivalent of "ch_DE.utf-8" -- Swiss
German encoded in UTF-8.
Internally Windows normally uses numeric locale IDs assigned by
Microsoft, though they do also have names.
The VC++ 7.1 documentation says you can use something similar to the
POSIX format:
lang ["_" country/region ["." code-page]]
So I suppose you would use "German_Switzerland.65001", though UTF-8
doesn't seem to be as fully supported in Windows as the older code
pages that use a maximum of 2 bytes per character.
|
This wouldn't work for variety of reasons. One is that Microsoft's standard
library cannot handle more than 2-byte multibyte encodings (at least
according to asserts in its code). Another is that Windows itself doesn't
allow arbitrary combinations of languages and codepages.
The original question about ch_DE.utf-8 is meaningless on Windows. It
doesn't have anything like Unix UTF-8 locales nor does it need them.
| Quote: | I have a sneaking suspicion that the country and language names may
themselves be localised according to the system locale, though.
|
They are not. The link below explains why
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_8rse.asp
The standard library uses LOCALE_SABBREVCTRYNAME, LOCALE_SENGCOUNTRY,
LOCALE_SABBREVLANGNAME and LOCALE_SENGLANGUAGE to build C++ locale names.
--
Eugene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
P.J. Plauger Guest
|
Posted: Mon May 03, 2004 9:23 am Post subject: Re: deriving from std::moneypunct facet |
|
|
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
| Quote: | Ben Hutchings wrote:
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
snip
So
how would you specify the equivalent of "ch_DE.utf-8" -- Swiss
German encoded in UTF-8.
Internally Windows normally uses numeric locale IDs assigned by
Microsoft, though they do also have names.
The VC++ 7.1 documentation says you can use something similar to the
POSIX format:
lang ["_" country/region ["." code-page]]
So I suppose you would use "German_Switzerland.65001", though UTF-8
doesn't seem to be as fully supported in Windows as the older code
pages that use a maximum of 2 bytes per character.
This wouldn't work for variety of reasons. One is that Microsoft's
standard
library cannot handle more than 2-byte multibyte encodings (at least
according to asserts in its code). Another is that Windows itself doesn't
allow arbitrary combinations of languages and codepages.
The original question about ch_DE.utf-8 is meaningless on Windows. It
doesn't have anything like Unix UTF-8 locales nor does it need them.
|
This is getting murkier by the minute.
1) A "code page" essentially defines a 256-byte character set.
You can treat that set of single-byte codes as a multibyte
encoding for a (very small) subset of Unicode/ISO-10646.
IIRC, the Swiss German code page to Unicode is one of the
conversions we also provide with our CoreX library.
2) UTF-8 is yet another multibyte encoding. It differs from
a code page in that it can represent *all* Unicode characters.
It takes up to three bytes to represent a character from
the 16-bit subset (aka UCS-2), up to six bytes to represent
all possible values that can be represented in 32 bits (roughly
aka UCS-4) -- somewhere in between for the current "maximum
number of characters that will ever be defined" (aka UTF-16).
We of course provide all these variant conversions in our
CoreX library.
3) Conversions are defined between a multibyte encoding and
a wide-character encoding. The latter often is some flavor
of Unicode, but it doesn't have to be.
So if you want the UTF-8 equivalent of the Swiss German
256-character encoding, you first convert it to 16-bit
Unicode and then convert that to UTF-8.
P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Mon May 03, 2004 3:09 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
[email]Michael.Karcher (AT) writeme (DOT) com[/email] (Michael Karcher) wrote in message
news:<c6svsa$g1m20$1 (AT) uni-berlin (DOT) de>...
| Quote: | kanze (AT) gabi-soft (DOT) fr wrote:
international currency symbol or domestic currency symbol.
Where does it say that in the standard? About the closest I could
find is a non normative footnote which says that "for international
instantiations (second template parameter true) this is always four
charactter long, usually three letters and a space",
It's all about "USD " vs. "$". Or as we had in germany before the euro
introduction "DEM " (Intl) vs. "DM" (Domestic). Now it is "EUR "
(intl) vs. the euro symbol, if supported in the character set.
|
But where does it say this? The non-normative foot-note sort of hints
at it, but only vaguely, and I can find nothing else.
(The other issue I'm wondering about is having the currency symbol
determined by the locale in an international environment, since I would
imagine that you are likely to be dealing with several different
currencies. For that matter, even without being international -- in
France today, it is frequent to display monetary values both in Euros
and in Francs.)
--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Mon May 03, 2004 3:17 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
| Quote: | Ben Hutchings wrote:
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
snip
So
how would you specify the equivalent of "ch_DE.utf-8" -- Swiss
German encoded in UTF-8.
Internally Windows normally uses numeric locale IDs assigned by
Microsoft, though they do also have names.
The VC++ 7.1 documentation says you can use something similar to the
POSIX format:
lang ["_" country/region ["." code-page]]
So I suppose you would use "German_Switzerland.65001", though UTF-8
doesn't seem to be as fully supported in Windows as the older code
pages that use a maximum of 2 bytes per character.
|
The Unix format uses the ISO 639 codes, in lower case, for the language,
and the ISO 3166 two letter codes, in upper case, for the country. (And
German speaking Switzerland should have been "de_CH", and not "ch_DE".)
As far as I know, the encoding names are ad hoc.
| Quote: | This wouldn't work for variety of reasons. One is that Microsoft's
standard library cannot handle more than 2-byte multibyte encodings
(at least according to asserts in its code).
|
It was just meant as an example -- it isn't reasonable to expect any
machine to support every possible locale. (On the other hand, any
machine connected to the Internet really should be able to support
UTF-8, since that is pretty much the standard international encoding
used.)
And of course, if the encoding in the locale doesn't correspond to that
of the font being used for display, what you see won't be what the
program thinks it is displaying.
| Quote: | Another is that Windows itself doesn't allow arbitrary combinations of
languages and codepages.
|
What I specified was the *format* of the names. Obviously, no machines
can support all combinations, nor should they. Who would use
"eu_KE.shift_jis" (Basque, used in Kenya and writing with Shift JIS
encoding) even if it existed? With the exception of various Unicode
representation formats, I think that most encodings are only valid for
certain languages, and not every language will be spoken in every
country in the world.
It would be nice to somehow be able to separate the three aspects, with
e.g. monetary formatting dependant only on the country, messages only on
the language, and encoding only on the encoding (or the fonts being
used). But I don't have any simple solutions to propose. (If I'm
formatting French Francs for an English language publication, I will
probably use . as the decimal, but standard French formatting for the
rest, for example, and a function like toupper mixes both language,
country and encoding intimately.)
| Quote: | The original question about ch_DE.utf-8 is meaningless on Windows. It
doesn't have anything like Unix UTF-8 locales nor does it need them.
|
Anything which connects to the Internet needs some sort of support for
UTF-8, since it is pretty much the standard international codeset on the
Internet.
But the initial posters question rests unanswered: how to find a list of
all supported locales. Except of course, for the somewhat vague answer:
it depends on the implementation. But I suspect that that is the best
we can do, since it does depend very strongly on the implementations.
--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Eugene Gershnik Guest
|
Posted: Tue May 04, 2004 5:36 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | "Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
The original question about ch_DE.utf-8 is meaningless on Windows.
It
doesn't have anything like Unix UTF-8 locales nor does it need them.
Anything which connects to the Internet needs some sort of support for
UTF-8, since it is pretty much the standard international codeset on
the Internet.
|
True but this has nothing to with UTF-8 locales. A system can support
conversions from an internal character set to UTF-8 (either directly or
through an intermediate format as Windows does) and be able to use the
Internet without knowing what a UTF-8 locale is.
Windows never uses UTF-8 as the encoding for narrow strings in any locale.
Instead it guarrantees that wchar_t encoding is locale-independent and is
always UTF-16. Conversions between UTF-16 and UTF-8 are pretty
straightforward. Thus, I'd say that a correct way to deal with UTF-8 in C++
on Windows is to work with wide streams and perform conversions in the
streambuf. This way the normal locale machinery deals with converting
between internal narrow encoding and "Unicode" while streambuf is
responsible for the "Unicode" representation on the wire i.e. UTF-8. Note
that it is very different from Unix where you must use some manual iconv()
wizardry if your user doesn't work in UTF-8 locale to begin with.
--
Eugene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Wed May 05, 2004 7:42 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
| Quote: | kanze (AT) gabi-soft (DOT) fr wrote:
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
The original question about ch_DE.utf-8 is meaningless on Windows.
It doesn't have anything like Unix UTF-8 locales nor does it need
them.
Anything which connects to the Internet needs some sort of support
for UTF-8, since it is pretty much the standard international
codeset on the Internet.
True but this has nothing to with UTF-8 locales.
|
According to the standard, conversion on input and output depend on the
locale embedded in the filebuf. If you want to encode to or from UTF-8,
you need a locale which supports it.
| Quote: | A system can support conversions from an internal character set to
UTF-8 (either directly or through an intermediate format as Windows
does) and be able to use the Internet without knowing what a UTF-8
locale is.
|
A system can support just about anything, in addition to what the
standard requires. The standard provides a more or less standard way of
specifying the transcoding between internal and external format: the
codecvt facet of the locales. IMHO, this part of the standard libaray
wasn't particularly well designed, but it is what the standard says, and
I would be very unhappy about an implementation that provided support
for the functionality, but didn't offer it as well through the standard
mechanism, given that they exist.
| Quote: | Windows never uses UTF-8 as the encoding for narrow strings in any
locale.
|
There are two separate issues here: what comes with a given compiler
(VC++, Borland, etc.), and what can be added. From a posting by
Plauger, I gather that it IS possible to at least add such support to
VC++. What I do know is that imbuing an [iofstream] with an UTF-8
locale under Windows works. What I don't know is how to name the
locale, nor whether the locale comes packaged with the compiler, or must
be acquired separately.
In general, of course, the fact that you need a specific locale doesn't
mean that your system provides it. And while I don't know about
Windows, under Unix, what is available will depend on the particular
installation of the system -- you simply cannot know beforehand. I
think that it is also possible to acquire locales not normally provided
from third party sources; I would be very surprised if Dinkumware didn't
have some to cover cases the system provider didn't think of, or didn't
think necessary.
| Quote: | Instead it guarrantees that wchar_t encoding is locale-independent and
is always UTF-16. Conversions between UTF-16 and UTF-8 are pretty
straightforward. Thus, I'd say that a correct way to deal with UTF-8
in C++ on Windows is to work with wide streams and perform conversions
in the streambuf.
|
This is exactly what we are talking about. And the conversion in
streambuf (actually in filebuf) depends on the locale imbued in the
streambuf.
| Quote: | This way the normal locale machinery deals with converting between
internal narrow encoding and "Unicode" while streambuf is responsible
for the "Unicode" representation on the wire i.e. UTF-8. Note that it
is very different from Unix where you must use some manual iconv()
wizardry if your user doesn't work in UTF-8 locale to begin with.
|
What's available on any given Unix machine will depend on what the
sysadmin decided to install -- by default, Solaris gives you just about
everything, but the Solaris systems I work on have small disks, and the
sysadmin stripped a lot of it out. Every Unix system I've seen recently
has had at least one UTF-8 locale installed -- by default, both Solaris
and Linux have UTF-8 versions of all of the national or language based
locales. With a conforming implementation (Sun CC, for example), you
imbue the filebuf (or the [io]fstream, which then imbues the filebuf)
with the desired locale, exactly like under Windows -- if the locale is
present (and it usually is under Unix), then it works.
The one thing that is different under Unix is that you often have to
work with older compilers -- even the latest version of Sun CC isn't as
conformant as VC++ 6.0, and the 3.x branch of g++ is only recently
become stable, and the 2.95.x branch didn't support standard iostream's
at all. With older compilers, you often have to deal with the C level
locales, and set the locale globallY, via setlocale, rather than imbuing
the stream. But it still worked. I've input and output UTF-8 under
both Solaris and Linux, and I've never heard of iconv.
--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Eugene Gershnik Guest
|
Posted: Fri May 07, 2004 12:08 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | "Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote in message
news:<Q-OdndRI7N1sqQrdRVn-hg (AT) speakeasy (DOT) net>...
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
The original question about ch_DE.utf-8 is meaningless on Windows.
It doesn't have anything like Unix UTF-8 locales nor does it need
them.
Anything which connects to the Internet needs some sort of support
for UTF-8, since it is pretty much the standard international
codeset on the Internet.
True but this has nothing to with UTF-8 locales.
According to the standard, conversion on input and output depend on
the locale embedded in the filebuf. If you want to encode to or from
UTF-8, you need a locale which supports it.
|
Alternatively I can use a custom streambuf. One will probably be required
anyway for a real-life networking.
| Quote: | Windows never uses UTF-8 as the encoding for narrow strings in any
locale.
There are two separate issues here: what comes with a given compiler
(VC++, Borland, etc.), and what can be added.
|
There is a third issue. If un underlying platform supports its own concept
of locales the C++ ones should better play it nice and be interoperable with
them. The C++ library cannot encompass all possible needs and resorting to
system specific calls may sometimes be necessary. If there is no good
mapping between a system locale and a C++ one this may make it hard if not
impossible. There is no such thing as UTF-8 system locale on Windows and a
C++ library should IMHO reflect this fact.
| Quote: | Instead it guarrantees that wchar_t encoding is locale-independent
and is always UTF-16. Conversions between UTF-16 and UTF-8 are
pretty straightforward. Thus, I'd say that a correct way to deal
with UTF-8 in C++ on Windows is to work with wide streams and
perform conversions in the streambuf.
This is exactly what we are talking about. And the conversion in
streambuf (actually in filebuf) depends on the locale imbued in the
streambuf.
|
What I had in mind was to use a custom streambuf.
| Quote: | This way the normal locale machinery deals with converting between
internal narrow encoding and "Unicode" while streambuf is responsible
for the "Unicode" representation on the wire i.e. UTF-8. Note that
it is very different from Unix where you must use some manual iconv()
wizardry if your user doesn't work in UTF-8 locale to begin with.
With a conforming implementation
(Sun CC, for example), you imbue the filebuf (or the [io]fstream,
which then imbues the filebuf) with the desired locale, exactly like
under Windows -- if the locale is present (and it usually is under
Unix), then it works.
[...]
I've input and
output UTF-8 under both Solaris and Linux, and I've never heard of
iconv.
|
Here is the scenario I meet quite often. Suppose you have a text file in the
user's default encoding which is _not_ UTF-8 (say EUC or Shift-JIS). You
need to read and save it in another file ('network') encoded in UTF-8. I may
be dead wrong but I don't think you can generally avoid iconv() in this
case.
--
Eugene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
P.J. Plauger Guest
|
Posted: Sun May 09, 2004 12:15 pm Post subject: Re: deriving from std::moneypunct facet |
|
|
"Eugene Gershnik" <gershnik (AT) hotmail (DOT) com> wrote
| Quote: | Here is the scenario I meet quite often. Suppose you have a text file
in
the
user's default encoding which is _not_ UTF-8 (say EUC or Shift-JIS).
You need to read and save it in another file ('network') encoded in
UTF-8. I
may
be dead wrong but I don't think you can generally avoid iconv() in
this case.
|
Unless you have a collection of handy codecvt facets that does all these
conversions, with supporting classes that make them easy to use. That's the
approach we took with our CoreX library.
P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Mon May 10, 2004 10:03 am Post subject: Re: deriving from std::moneypunct facet |
|
|
"Eugene Gershnik" <gershnik (AT) hotmail (DOT) com> wrote
| Quote: | kanze (AT) gabi-soft (DOT) fr wrote:
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote in message
news:<Q-OdndRI7N1sqQrdRVn-hg (AT) speakeasy (DOT) net>...
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
"Eugene Gershnik" <gershnik (AT) nospam (DOT) hotmail.com> wrote
The original question about ch_DE.utf-8 is meaningless on
Windows. It doesn't have anything like Unix UTF-8 locales nor
does it need them.
Anything which connects to the Internet needs some sort of
support for UTF-8, since it is pretty much the standard
international codeset on the Internet.
True but this has nothing to with UTF-8 locales.
According to the standard, conversion on input and output depend on
the locale embedded in the filebuf. If you want to encode to or
from UTF-8, you need a locale which supports it.
Alternatively I can use a custom streambuf. One will probably be
required anyway for a real-life networking.
|
The logical solution to this would be to use a separate filtering
streambuf for the code translation. It's a bit of a shame that the
standard merged this (logically separate) concept into filebuf, instead
of making it generally available. I believe that some third party
libraries do provide this as an extention. Even without it, however, it
would be foolish not to leverage off the existing library code (e.g. the
codecvt facet).
| Quote: | Windows never uses UTF-8 as the encoding for narrow strings in any
locale.
There are two separate issues here: what comes with a given
compiler (VC++, Borland, etc.), and what can be added.
There is a third issue. If un underlying platform supports its own
concept of locales the C++ ones should better play it nice and be
interoperable with them. The C++ library cannot encompass all
possible needs and resorting to system specific calls may sometimes be
necessary. If there is no good mapping between a system locale and a
C++ one this may make it hard if not impossible. There is no such
thing as UTF-8 system locale on Windows and a C++ library should IMHO
reflect this fact.
|
I'm not quite sure what you are saying here. That we should ignore the
standard anytime the local platform has a different way of doing
something? That the C++ library should not attempt to furnish behaviors
that the local platform doesn't furnish directly and in a compatible
form.
The C++ way of handing different file encodings is by means of the
codecvt facet. The Microsoft compiler, at least since 5.0, has had a
very good implementation of this -- for whatever reasons, Microsoft was
very much in advance of most C++ implementations in this regard. At the
character encoding, rather than specifying a separate character code for
each locale. But this has nothing to do with how the locales work in
C++ ; you can easily create a locale based on a language specific
locale, and then embed your UTF-8 specific facets in it. This would
seem to be the most logical and the simplest way to do things.
| Quote: | Instead it guarrantees that wchar_t encoding is locale-independent
and is always UTF-16. Conversions between UTF-16 and UTF-8 are
pretty straightforward. Thus, I'd say that a correct way to deal
with UTF-8 in C++ on Windows is to work with wide streams and
perform conversions in the streambuf.
This is exactly what we are talking about. And the conversion in
streambuf (actually in filebuf) depends on the locale imbued in the
streambuf.
What I had in mind was to use a custom streambuf.
|
Fine. But it would be very strange, or at least, very un-C++ish, if it
didn't use the codecvt facet for the code translation. No sense is
reinventing the wheel.
| Quote: | This way the normal locale machinery deals with converting between
internal narrow encoding and "Unicode" while streambuf is
responsible for the "Unicode" representation on the wire
i.e. UTF-8. Note that it is very different from Unix where you
must use some manual iconv() wizardry if your user doesn't work in
UTF-8 locale to begin with.
With a conforming implementation (Sun CC, for example), you imbue
the filebuf (or the [io]fstream, which then imbues the filebuf)
with the desired locale, exactly like under Windows -- if the
locale is present (and it usually is under Unix), then it works.
[...]
I've input and
output UTF-8 under both Solaris and Linux, and I've never heard of
iconv.
Here is the scenario I meet quite often. Suppose you have a text file
in the user's default encoding which is _not_ UTF-8 (say EUC or
Shift-JIS). You need to read and save it in another file ('network')
encoded in UTF-8. I may be dead wrong but I don't think you can
generally avoid iconv() in this case.
|
You're dead wrong. The standard idiom would be to open the source file
with a locale using the user's default encoding, and to open the
destination file with a locale supporting UTF-8, and then copy.
Something like:
std::ifstream source( sourceFilename.c_str() ) ;
std::ofstream dest( destFilename.c_str() ) ;
dest.imbue( std::locale( std::locale(),
"en_US.utf-8",
std::locale::ctype ) ) ;
dest << source.rdbuf() ;
(Modulo erreur handling, of course. You would normally verify that the
open's worked, for example.)
In theory, anyway -- in practice, Unix compilers tend to be far behind
Microsoft in terms of standard conformance.
IMHO, this is also the preferred solution under Windows. It should
work, provided you change the locale name to whatever the Windows
conventions require.
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|