 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Roger Guest
|
Posted: Mon Oct 13, 2003 8:47 pm Post subject: Unicode plane detection |
|
|
I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts >>
unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {
const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}
if (inp & 0x40) {
nb++;
}
if (inp & 0x20) {
nb++;
}
if (inp & 0x10) {
nb++;
}
printf("%d byte unicode is found %Xn",nb,inpx);
if (inp & PLANE_ONE_UNICODE ) {
printf("plane 1 char foundn");
}
}
}
This routine does not take account of surrogate pairs of utf-16
characters/bytes. Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
This code has to work on unix and W2k so I dare say there are byte
order issues to deal with too. The client has some Python code which
seems to have some unicode routines built in (e.g. ord()) - I've not
been able to find any equivalent routines for C++.
Anyone out there help me please? Virtual pint to those who can.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Ben Hutchings Guest
|
Posted: Thu Oct 16, 2003 2:32 pm Post subject: Re: Unicode plane detection |
|
|
In article <dd7a565c.0310130435.67fab35b (AT) posting (DOT) google.com>,
Roger wrote:
| Quote: | I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts
unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {
|
Presumably inp is declared with type int?
| Quote: | const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}
|
This test is pointless, given the condition of the enclosing if
statement.
| Quote: | if (inp & 0x40) {
nb++;
}
if (inp & 0x20) {
nb++;
}
if (inp & 0x10) {
nb++;
}
|
These if statements must be nested, because you must count only
contiguous 1-bits.
| Quote: | printf("%d byte unicode is found %Xn",nb,inpx);
if (inp & PLANE_ONE_UNICODE ) {
printf("plane 1 char foundn");
}
|
This doesn't make sense; you're testing the first byte of the
sequence and not the code that the whole sequence represents.
You need to test that nb == 4 (if it is smaller, the code value
must be in plane 0), and then combine the first two bytes of the
sequence to find whether the code is in plane 1.
| Quote: | }
}
code Ends
This routine does not take account of surrogate pairs of utf-16
characters/bytes.
|
The codes reserved for surrogates are illegal in UTF-8, but you
may want to detect them anyway in case the file is encoded
wrongly.
| Quote: | Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
snip |
Try IBM's open source International Components for Unicode
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Rob Lugt Guest
|
Posted: Fri Oct 17, 2003 9:45 am Post subject: Re: Unicode plane detection |
|
|
Hi Roger,
OpenTop [1] contains excellent support for Unicode. Here is a mini program
which uses OpenTop streams which will probably do what you're after. This
will print out the character position (not byte position) of each Unicode
character above the Basic Multilingual Plan (BMP):-
#include <ot/io/FileInputStream.h>
#include <ot/io/InputStreamReader.h>
#include <iostream>
using namespace std;
using namespace ot;
using namespace ot::io;
int main(int argc, char* argv[])
{
try {
RefPtr<InputStream> rpIS(new FileInputStream("input.txt");
RefPtr<Reader> rpRdr(new InputStreamReader(rpIS.get()));
Character ch;
long count=0;
while( (ch = rpRdr->readAtomic()) != Character::EndOfFileCharacter)
{
if(ch.toUnicode() > 0xFFFF)
cout << "Unicode character << ch.toUnicode() << " in
character position " << count << endl;
count++;
}
}
catch(Exception& e) {
cout << e.toString() << endl;
}
return 0;
}
Best regards
Rob Lugt
[1] http://www.elcel.com/products/opentop
"Roger"
| Quote: | I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts
|
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Fri Oct 17, 2003 12:05 pm Post subject: Re: Unicode plane detection |
|
|
Ben Hutchings <do-not-spam-benh (AT) bwsint (DOT) com> wrote
| Quote: | In article <dd7a565c.0310130435.67fab35b (AT) posting (DOT) google.com>,
Roger wrote:
I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in
an xml file. So far I've assumed UTF-8 encoding, so I am checking
for the first byte in a utf-8 multi-byte sequence. The code is as
follows: code starts
unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {
Presumably inp is declared with type int?
|
That is the standard idiom.
| Quote: | const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}
This test is pointless, given the condition of the enclosing if
statement.
|
The whole approch seems wrong to me. When I had to do the same thing, I
just created a table with 256 int's, and indexed into it. I then used
the number of bytes as an index into a second table with the mask for
the first byte. Roughly:
int ch = source.get()
while ( ch != EOF ) {
int byteCount = lengthTable[ ch ] ;
if ( byteCount <= 0 ) {
throw IllegalCharacter() ;
}
-- byteCount ;
uint32_t result = ch & maskTable[ byteCount ] ;
while ( byteCount > 0 ) {
ch = source.get() ;
if ( ch == EOF ) {
throw IncompleteCharacter() ;
} else if ( (ch & 0xC0) != 0x80 ) {
throw IllegalCharacter() ;
}
result = (result << 6) | (ch & 0x3F) ;
}
store( result ) ;
}
(The code was actually more complicated, because it was designed to
implement a codecvt -- and I had to store intermediate states in a state
type, so that partial reads could work.)
[...]
| Quote: | Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
snip
Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.
|
IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.
I have a somewhat sketchy and experimental version in the Experimental
section at my site; some of the code might be usable as well.
--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Andy Heninger Guest
|
Posted: Sat Oct 18, 2003 9:25 am Post subject: Re: Unicode plane detection |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | Ben Hutchings <do-not-spam-benh (AT) bwsint (DOT) com> wrote in message
Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.
IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.
|
The ICU project tries very hard to stay completely up to date with the
the Unicode standard. The latest, Unicode 4.0, was released in April of
this year; support for it in ICU came with version 2.6, released in June.
http://www.unicode.org/versions/Unicode4.0.0/
It's got a lot of characters,
| Quote: | Graphic 96,248
Format 134
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 878,083
|
ICU uses UTF-16 internally, meaning that the non-plane-0 code points are
represented by a pair of 16 bit values.
Returning to the original topic of this thread, ICU includes C #define
macros for working with plain UTF-8 and UTF-16 strings (how un-C++ can
you get?). Borrowing code from these would be a good choice for anyone
needing to pick out characters from UTF-8 encoded char * strings, even
if no other part of ICU is wanted. The file of interest is
icu/source/common/unicode/utf8.h.
Historically, the choice of UTF-16 as the primary storage format for ICU
does go way back to the very early days of Unicode, and to Java, with
its 16 bit Unicode chars.
ICU began life as a Java library, and was the original source for much
of the i18n support in the JDK. The Java classes were subsequently
ported to C++ and became ICU4C, and from there, parallel plain C APIs
were developed for use from applications that are allergic to C++ for
one reason or another. (There are still many of these around.)
-- Andy Heninger
[email]heninger (AT) us (DOT) ibm.com[/email]
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Aaron Bentley Guest
|
Posted: Sat Oct 18, 2003 1:36 pm Post subject: Re: Unicode plane detection |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
snip
Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.
IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.
|
ICU 2.6 supports Unicode 4.0, but since its storage is 16 bit, it's too
tricky to modify strings. If read-only access is adequate, it can be
used. (Oh, and the OP apparently just wants utf-16 support.)
Right now I'm using the ICU C library with my vector<UChar32> and it
works, but I can imagine ditching it later for something that supports
direct utf8->utf32 conversions.
Aaron
--
Aaron Bentley
www.aaronbentley.com
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
James Kanze Guest
|
Posted: Sun Oct 19, 2003 9:45 pm Post subject: Re: Unicode plane detection |
|
|
Andy Heninger <heninger (AT) us (DOT) ibm.com> writes:
| Quote: | kanze (AT) gabi-soft (DOT) fr wrote:
IBM's open source effort is to be applauded, but it apparently
got started before Unicode 3.2 -- it only supports the UCS-2 (or
UTF-16, I'm not sure), and not full Unicode.
The ICU project tries very hard to stay completely up to date with
the the Unicode standard. The latest, Unicode 4.0, was released in
April of this year; support for it in ICU came with version 2.6,
released in June.
http://www.unicode.org/versions/Unicode4.0.0/
It's got a lot of characters,
Graphic 96,248
Format 134
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 878,083
ICU uses UTF-16 internally, meaning that the non-plane-0 code points
are represented by a pair of 16 bit values.
|
Thanks for the information. I wasn't aware what the current status was;
all I did know was that their "Unicode" characters were 16 bits, and
that true effective Unicode support requires at least 21 bits (and so in
practice, 32 bits on most machines).
| Quote: | Returning to the original topic of this thread, ICU includes C
#define macros for working with plain UTF-8 and UTF-16 strings (how
un-C++ can you get?). Borrowing code from these would be a good
choice for anyone needing to pick out characters from UTF-8 encoded
char * strings, even if no other part of ICU is wanted. The file of
interest is icu/source/common/unicode/utf8.h.
|
Borrowing working code (when it is legal) is always a good idea:-). On
the other hand, converting between UTF-8 and full 32 bit ISO 10646 is
rather simple. Much of the value of ICU lies in the enormous amount of
other support it offers.
| Quote: | Historically, the choice of UTF-16 as the primary storage format for
ICU does go way back to the very early days of Unicode, and to Java,
with its 16 bit Unicode chars.
|
I know the historical reasons. It's a real problem. Still, if I were
starting a new project in C++ today, and had a choice, I'd go with 32
bit wide characters.
--
James Kanze mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France +33 1 41 89 80 93
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|