C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

EOF problem
Goto page 1, 2  Next
 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
BH
Guest





PostPosted: Sun Apr 18, 2004 2:58 pm    Post subject: EOF problem Reply with quote



I'm doing a decryption prog which read a enc. file and decrypt it. However,
some of the encrypted text is the same as EOF and thus i cant read the text
that after it. What should i do?

I tried the getline and read.. but still cant work.. can anyone help me?



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
John Potter
Guest





PostPosted: Mon Apr 19, 2004 5:47 am    Post subject: Re: EOF problem Reply with quote



On 18 Apr 2004 10:58:45 -0400, "BH" <bbg (AT) cc (DOT) com> wrote:

Quote:
I'm doing a decryption prog which read a enc. file and decrypt it. However,
some of the encrypted text is the same as EOF and thus i cant read the text
that after it. What should i do?

Here is a little code that should show the answer. One of my compilers
gives a result of 25 256. Without the second argument for ofs, it gives
25 257.

#include <fstream>
#include <iostream>
#include <iterator>
using namespace std;
int main () {
{
ofstream ofs("junk.tmp", ios::binary);
for (int x = 0; x != 256; ++ x)
ofs.put(x);
}
{
ifstream ifs1("junk.tmp");
cout << distance(istream_iterator> noskipws),
istream_iterator<char>()) << endl;
}
{
ifstream ifs2("junk.tmp", ios::binary);
cout << distance(istream_iterator> noskipws),
istream_iterator<char>()) << endl;
}
}

John

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Michael Karcher
Guest





PostPosted: Mon Apr 19, 2004 5:50 am    Post subject: Re: EOF problem Reply with quote



BH <bbg (AT) cc (DOT) com> wrote:
Quote:
I'm doing a decryption prog which read a enc. file and decrypt it. However,
some of the encrypted text is the same as EOF and thus i cant read the text
that after it. What should i do?
Open the encrypted file with the ios::binary flag.


Michael Karcher

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Hutchings
Guest





PostPosted: Mon Apr 19, 2004 5:51 am    Post subject: Re: EOF problem Reply with quote

BH wrote:
Quote:
I'm doing a decryption prog which read a enc. file and decrypt it. However,
some of the encrypted text is the same as EOF

That's not really possible. However, in a file opened in text mode in
DOS or Windows a byte with a value of 26 is treated as indicating
end-of-file. This is done for ancient compatibility with CP/M.

Quote:
and thus i cant read the text
that after it. What should i do?

I tried the getline and read.. but still cant work.. can anyone help me?

Open the file in binary mode, not text mode. Use the read() member
function of the filebuf, not getline().

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Dave Moore
Guest





PostPosted: Mon Apr 19, 2004 6:46 pm    Post subject: Re: EOF problem Reply with quote

"BH" <bbg (AT) cc (DOT) com> wrote

Quote:
I'm doing a decryption prog which read a enc. file and decrypt it. However,
some of the encrypted text is the same as EOF and thus i cant read the text
that after it. What should i do?

I tried the getline and read.. but still cant work.. can anyone help me?


You need to use unformatted input .. something like:

std::ifstream encfile("encfilename", std::ios::in|std::ios::binary);

should do the trick for opening the input file. Then, you need to use
the one-character-at-a-time input method of istream, namely get().
The reason getline and other multiple-character-reading functions fail
for you is that all of them rely on a terminator .. even read (c.f.
TC++PL section 21.3.4). In your case, since *any* ASCII value can
result from your encryption algorithm, there is no logical choice for
the terminator. So, just read your characters one-at-a-time using get
and all should be well.
(ASIDE ... I am not sure that opening the ifstream with the
std::ios::binary specification is really necessary when using the get
function, but I always do it myself for maximum readability. It
certainly doesn't hurt anything).

Now, there are a couple of related issues you should keep in mind:

1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

2) Now, since you are reading in your data as raw bytes, you must
handle the type-conversion by yourself ... this would most likely be
an issue if you are using a character-type that is longer than 1-btye
(e.g. wchar_t). For example:

while (!encfile.eof()) {

const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
// code using data

}

Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)
... it would be somewhat safer to check the state of the stream after
each get operation.

HTH, Dave Moore

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Dave Moore
Guest





PostPosted: Tue Apr 20, 2004 7:45 pm    Post subject: Re: EOF problem Reply with quote

[email]dtmoore (AT) rijnh (DOT) nl[/email] (Dave Moore) wrote in message
news:<306d400f.0404190207.3c4ec8f6 (AT) posting (DOT) google.com>...
Quote:
"BH" <bbg (AT) cc (DOT) com> wrote

I'm doing a decryption prog which read a enc. file and decrypt it.
However, some of the encrypted text is the same as EOF and thus i
cant read the text that after it. What should i do?

I tried the getline and read.. but still cant work.. can anyone help me?


You need to use unformatted input .. something like:

std::ifstream encfile("encfilename", std::ios::in|std::ios::binary);

should do the trick for opening the input file. Then, you need to use
the one-character-at-a-time input method of istream, namely get().
The reason getline and other multiple-character-reading functions fail
for you is that all of them rely on a terminator .. even read (c.f.
TC++PL section 21.3.4). In your case, since *any* ASCII value can
result from your encryption algorithm, there is no logical choice for
the terminator. So, just read your characters one-at-a-time using get
and all should be well.
(ASIDE ... I am not sure that opening the ifstream with the
std::ios::binary specification is really necessary when using the get
function, but I always do it myself for maximum readability. It
certainly doesn't hurt anything).

Now, there are a couple of related issues you should keep in mind:

1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

2) Now, since you are reading in your data as raw bytes, you must
handle the type-conversion by yourself ... this would most likely be
an issue if you are using a character-type that is longer than 1-btye
(e.g. wchar_t). For example:

while (!encfile.eof()) {

const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
// code using data

}

Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)
.. it would be somewhat safer to check the state of the stream after
each get operation.


Actually, after a bit of testing using GCC 3.4.0 and cygwin, I find that
there is no collison between EOF and any of the ASCII character set ...
running John Potters' example code from this thread produces output "256
256", with any ordering of the ASCII values. So, my solution using get (as
opposed to getline or read) is probably not necessary on all systems .. it
should work for the OP though ... I had a similar problem a while back using
(compiler != GCC) and that is how I solved it. Also, after more
reading/testing it seems it is not necessary to use std::ios::binary if you
use any of get, getline or read .. they are for character based, unformatted
input anyway.

HTH, Dave Moore

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ulrich Eckhardt
Guest





PostPosted: Tue Apr 20, 2004 9:47 pm    Post subject: Re: EOF problem Reply with quote

Dave Moore wrote:
Quote:
1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

<Luke Skywalker> Noooooo!!!! </Luke Skywalker>

Seriously, this is dangerous advice. You should always perform a (series of)
read-operation and _AFTERWARDS_ check the streamstate. If the state has the
failbit set, you discard the values you read (or didn't read).

For formatted input (which is not the case here), you can then decide what
to do by the eofbit. If it is set, you simply reached EOF, else, some
formatting error occured.

Therefore:

std::istream::int_type c;
while(encfile.get(c))
{
// use 'c' here
}

Quote:
while (!encfile.eof()) {
const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast

Sorry, but this cast is IMHO wrong(though it might slip in practice).
'*raw_wchar' yields a reference to a char. Your cast then forces it to a
wchar_t reference.
Apart from the fact that this should rather be a reinterpret_cast, the cast
should have been done earlier:

wchar_t wc;
encfile.read( reinterpret_cast<char*>(&wc), sizeof wc);
if(encfile)
{ /* use 'wc' here */ }

BTW: neither of these solutions address endianess or varying sizes of
wchar_t!!! This rather calls for a carefully applied uint*_t, lest the
portable C++ code create non-portable files.

Quote:
Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)
.. it would be somewhat safer to check the state of the stream after
each get operation.

Hmmm, there's a facility to have the stream throw an exception whenever the
streamstate hits fail or eof. Using that seems convenient but I haven't
tried it yet. Are there any drawbacks/caveats to that method?

Uli

--
FAQ: http://parashift.com/c++-faq-lite/
/* bittersweet C++ */
default: break;

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Tue Apr 20, 2004 10:39 pm    Post subject: Re: EOF problem Reply with quote

[email]dtmoore (AT) rijnh (DOT) nl[/email] (Dave Moore) wrote in message
news:<306d400f.0404190207.3c4ec8f6 (AT) posting (DOT) google.com>...

Quote:
Now, there are a couple of related issues you should keep in mind:

1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

No. The two do not mean the same thing; in practice, the first is
guaranteed to work, whereas with the second you may occasionally skip
the last element in the file, seeing eof too soon.

Note that both only check the current state of the file. Generally, the
state of the file is only interesting immediately after an attempt to
read, and it tells whether that read was successful. So usually, you
will do something like:

std::string line ;
while ( getline( encfile, line ) ) { ... }

or, for a binary file:

char aByte ;
while ( encfile.get( aByte ) ) { ... }

or

int aByte ;
aByte = encfile.get() ;
while ( aByte != EOF ) {
// ...
aByte = encfile.get() ;
}

A Pascal-like idiom is also possible:

while ( enc.peek() != EOF ) {
char ch = enc.get() ;
// ...
}

Quote:
2) Now, since you are reading in your data as raw bytes, you must
handle the type-conversion by yourself ... this would most likely be
an issue if you are using a character-type that is longer than 1-btye
(e.g. wchar_t).

For all pratical purposes, he's reading bytes, not characters. Which
means:

- he can only use std::ifstream, not std::wifstream,
- he has to open the file using binary, and
- he has to inbue the "C" locale, to ensure that there is no code
translation in the filebuf.

Quote:
For example:

while (!encfile.eof()) {

const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
// code using data

}

Have you actually tried this? It shouldn't even compile, and the
technique for converting a sequence of bytes to something larger is very
fragile, and will fail in many configurations.

If the encrypted stream in fact contains wchar_t data, then he may or
may not have to convert it, depending on how it was written. But until
now, that wasn't his problem; his problem was that he wasn't recovering
the bytes he wrote out. If the original data was wchar_t, of course, it
depends. It's possible that he has and uses a local which reads and
writes wchar_t without any real code translation; in this case, there is
no problem. It's also possible, and generally preferable, to write
anything binary as a pure byte stream. In this case, how he reads it
depends on how it was written -- the two have to be compatible.

In fact, if he is dealing with wchar_t internally (and the wchar_t is a
form of Unicode, UTF-16 or UCS-2), the best solution is probably to
write it out as UTF-8, and encrypt that. The way locales were
integrated into the iostream doesn't make this particularly easy,
however, since the conversion is built into the filebuf object, which
writes directly to a file, rather than being a separate filter, which
could then output to an encrypting filter which outputs to the final
filebuf. (Bill Plauger has indicated several times here that Dinkumware
has a set of extensions to streambufs which handle this. I don't know
what they charge for them, but it is certainly less than it would cost
to develop them yourself. There's also a good chance that they work
correctly; given the complexities of codecvt, getting code that actually
works is far from trivial, and the advantage should not be overlooked.)

Quote:
Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)

It also supposes a lot about the way the bytes were written, such as
byte order. Most of what is supposes can't reasonably be counted on,
since it varies greatly from one implementation to the next (as does the
size of a wchar_t, for example -- which is why I would go for UTF-Cool.

Quote:
.. it would be somewhat safer to check the state of the stream after
each get operation.

That is a prerequisite for correct code. Or more correctly, you check
the state of the stream before using any data you've read.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Dave Moore
Guest





PostPosted: Wed Apr 21, 2004 7:29 pm    Post subject: Re: EOF problem Reply with quote

Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote

Quote:
Dave Moore wrote:
1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

Luke Skywalker> Noooooo!!!! </Luke Skywalker

Seriously, this is dangerous advice. You should always perform a (series of)
read-operation and _AFTERWARDS_ check the streamstate. If the state has the
failbit set, you discard the values you read (or didn't read).

Of course you are right .. I even mentioned this as an afterthought in
my post. Also, on going back and checking my own code to do a similar
task, I found that I used:

char c;
while (infile.get(c)) {//use c}

as correctly pointed out here ... sorry for the error.

Quote:
For formatted input (which is not the case here), you can then decide what
to do by the eofbit. If it is set, you simply reached EOF, else, some
formatting error occured.

This is exactly what I was trying to convey, since it is no longer
completely trivial for the OP because the EOF char now can mean
something else. I realize that this is not Standard C++, but since
the OP was already having the EOF collision, it is probably relevant
for him.

Quote:

Therefore:

std::istream::int_type c;
while(encfile.get(c))
{
// use 'c' here
}

while (!encfile.eof()) {
const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
Sorry, but this cast is IMHO wrong(though it might slip in practice).
'*raw_wchar' yields a reference to a char. Your cast then forces it to a
wchar_t reference.

Isn't this exactly what we want? By making the static_cast<const
wchar_t &>, we are simply saying that the memory address represented
by raw_wchar can be used as a (unmodifiable) wchar_t .. of course we
are responsible for making sure that the above statement is true in
all cases, but AFAIK, this sort of thing is usually unavoidable when
dealing with bytewise binary data.

Quote:
Apart from the fact that this should rather be a reinterpret_cast, the cast
should have been done earlier:

wchar_t wc;
encfile.read( reinterpret_cast<char*>(&wc), sizeof wc);
if(encfile)
{ /* use 'wc' here */ }


First of all, the read call above probably won't work for the OP,
because of the EOF collision he is apparently having. Since read uses
EOF as a default terminator, if it is available from the input, it
will terminate the read .. the OP said as much in his original post.

Second, reagrding the reinterpret_cast ... I take Stroustrup's advice
to avoid type-casting whenever necessary, and when it is necessary to
prefer static_cast over reinterpret_cast (c.f. TC++PL v3, sec. 6.2.7).
In this case, my test case using the static_cast compiles correctly
even at the highest warning level, and test code appears to give the
correct result. I realize this is far from a rigorous test, but I
still cannot see why the static_cast I used was incorrect ... perhaps
you can enlighten me further?

Quote:
BTW: neither of these solutions address endianess or varying sizes of
wchar_t!!! This rather calls for a carefully applied uint*_t, lest the
portable C++ code create non-portable files.

Ok .. endian-ness is, as you say, not addressed, but the use of SIZE
variable I used in my sample code should allow portability to systems
with a different size of wchar_t.

Quote:
Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)
.. it would be somewhat safer to check the state of the stream after
each get operation.

Hmmm, there's a facility to have the stream throw an exception whenever the
streamstate hits fail or eof. Using that seems convenient but I haven't
tried it yet. Are there any drawbacks/caveats to that method?

Uli

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Wed Apr 21, 2004 7:30 pm    Post subject: Re: EOF problem Reply with quote

Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote


Quote:
Dave Moore wrote:
[...]
const wchar_t &data = static_cast<const wchar_t &>(*raw_wchar);

Sorry, but this cast is IMHO wrong(though it might slip in practice).

If this compiles, then the compiler is seriously broken.

Or is it? I seem to remember some discussion concerning this, and that
the standard literally says "An expression e can be explicitly converted
to a type T using a static_cast of the form static_cast<T>(e) if the
declaration "T t(e);" is well formed, for some invented temporary
variable t. The effect of such an explicit conversion is the same as
performing the declaration and initialization and then using the
temporary variable as the result of the conversion."

In this case, of course "wchar_t const& t( *raw_wchar ) ; " is definitly
legal. It converts the first element of the array raw_wchar to a
wchar_t, putting the results in a temporary variable, and generates a
reference to it. Thus, if the orignal Unicode character were a wavy
dash ('u3030'), he will convert it to a '0' ('u0030'). (Not all
compilers get this right, however. To be truthful, one somehow expects
that when casting an lvalue to an lvalue, the result of the cast refer
to the original lvalue.)

What Dave Moore was probably trying to do requires a reinterpret_cast.
But of course, what he was trying to do doesn't really work, except in a
few special cases.

Quote:
Note that the above code implicitly assumes correct alignment of the
input stream (i.e. that it contains an integral number of wchar_t's)
.. it would be somewhat safer to check the state of the stream after
each get operation.

Hmmm, there's a facility to have the stream throw an exception
whenever the streamstate hits fail or eof. Using that seems convenient
but I haven't tried it yet. Are there any drawbacks/caveats to that
method?

Error conditions in a stream are sticky, so you can read four (or more)
characters, then check the error condition just once.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Wed Apr 21, 2004 7:31 pm    Post subject: Re: EOF problem Reply with quote

[email]dtmoore (AT) rijnh (DOT) nl[/email] (Dave Moore) wrote in message
news:<306d400f.0404191949.6ca6419 (AT) posting (DOT) google.com>...

Quote:
Actually, after a bit of testing using GCC 3.4.0 and cygwin, I find
that there is no collison between EOF and any of the ASCII character
set ... running John Potters' example code from this thread produces
output "256 256", with any ordering of the ASCII values.

How an implementation maps the actual data in a file to text is
implementation defined. The two most widespread mappings are:

Unix:
One to one, each byte in the file is transmitted as is. The end of
line indicator is the single character LF, which is the same as
'n', and there is no end of file indicator.

Windows:
The end of line indicator is the two character sequence CRLF, which
generally corresponds to 'r', 'n', and the control Z character
(0x1A) is interpreted as end of file.

Others exist, but tend to have limited use (except maybe for the old Mac
convention, which uses CR ('r') as an end of line separator).

Note that my calling them Unix and Windows should only be taken as an
indication concerning the environment where they are common; the
mappings are NOT implemented in the OS, but in the compiler libraries,
and there is nothing to prevent a Unix compiler from using the Windows
conventions, and vice-versa. Technically -- if your program writes
using the Windows conventions under Unix, or vice versa, other programs
may have trouble reading what it wrote. By default, I would expect any
compiler to read and write text using the local conventions. (By
default only. Given the widespread use of shared filesystems, there is
a strong argument for allowing some external factor to change this.)

Quote:
So, my solution using get (as opposed to getline or read) is probably
not necessary on all systems ..

Technically, reading binary data as text, or vice versa, is not
specified by the standard. It happens to work under Unix, but just
because something happens to work under certain conditions doesn't mean
that your program is correct.

Quote:
it should work for the OP though ... I had a similar problem a while
back using (compiler != GCC) and that is how I solved it. Also, after
more reading/testing it seems it is not necessary to use
std::ios::binary if you use any of get, getline or read .. they are
for character based, unformatted input anyway.

You are mixing up separate issues. External format is managed at three
different levels in an istream -- to read raw bytes, you have to ensure
transparence at all three levels:

1. The file being read can be opened in binary mode or in text mode.
This basically conditions the representation and the interpretation
of line separators and end of file. To read raw bytes, you *must*
open the file in binary mode. (The default is text.) In binary
mode, you are not guaranteed correct recognition of end of file --
you might get extra bytes at the end, and you cannot reliably break
the input up into lines (read with getline). In practice, this is
not a problem with any modern OS I am familiar with (but I have used
OS's in the past in which binary mode gave you extra bytes at the
end).

2. The bytes read from the file in 1, above, will be translated using
the codecvt facet of the imbued locale. It is necessary to ensure
that the imbued locale uses a transparent translation; that
codecvt::always_noconv() returns true. This is guaranteed for the
"C" locale. This is NOT the default. (The default is the current
global locale. Since you've probably started the program with
something like "std::locale::global( std::locale( "" ) )", this will
typically depend on your external environment. On my machine, using
my usual environment, it just happens that the codecvt is
non-converting too, so if I forget to imbue, I'm very likely not to
notice the error. I suspect that this is the usual situation for
most people living in western Europe or the Americas. But just
because the code seems to work in your local environment doesn't
mean it is correct.)

3. The translated bytes are "interpreted" by istream. An istream can
do two types of input, formatted (using >>) and unformatted
(everything else). Obviously, you read raw bytes using unformatted
input -- get, etc.

Again, the fact that omitting one of these steps happens to work in one
particular environment doesn't mean that the program is correct. It
only means that the error is hidden for the moment.

Finally, for those who have to deal with older iostreams as well: in the
classic IO streams, there was no step 2, but otherwise, all of the above
holds. (Frankly, I would have expected filebuf to suppress code
translation if the file was opened in binary as well. But that's not
what the standard requires.)

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Dave Moore
Guest





PostPosted: Wed Apr 21, 2004 7:36 pm    Post subject: Re: EOF problem Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote in message news:<d6652001.0404200355.1e97ca5d (AT) posting (DOT) google.com>...
Quote:
dtmoore (AT) rijnh (DOT) nl (Dave Moore) wrote in message
news:<306d400f.0404190207.3c4ec8f6 (AT) posting (DOT) google.com>...

Now, there are a couple of related issues you should keep in mind:

1) Now the proper way to check for the end of your file is by
checking the state of the istream: e.g.

while (encfile) {//whatever}

or (better IMO)

while (!encfile.eof()) {//whatever}

No. The two do not mean the same thing; in practice, the first is
guaranteed to work, whereas with the second you may occasionally skip
the last element in the file, seeing eof too soon.

Note that both only check the current state of the file. Generally, the
state of the file is only interesting immediately after an attempt to
read, and it tells whether that read was successful. So usually, you
will do something like:

*examples deleted*

Yes, you are right ... I should have been more careful with my
examples .. see my response to the preceeding post.

Quote:


2) Now, since you are reading in your data as raw bytes, you must
handle the type-conversion by yourself ... this would most likely be
an issue if you are using a character-type that is longer than 1-btye
(e.g. wchar_t).

For all pratical purposes, he's reading bytes, not characters. Which
means:

- he can only use std::ifstream, not std::wifstream,
- he has to open the file using binary, and
- he has to inbue the "C" locale, to ensure that there is no code
translation in the filebuf.

Point taken (at least I think so) .. actually this is beyond my
personal experience, so I didn't know about it. I guess it
invalidates my other post in this thread saying that it is ok to omit
the std::ios::binary when opening the file. However, it seems like
one or the other of the last two options should be enough ... if
std::ios::binary is used, shouldn't the filebuf be left alone in any
case? Conversely, if you imbue the "C" locale, that should ensure
that a character is a byte, in which case bytewise==characterwise.

Quote:

For example:

while (!encfile.eof()) {

const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
// code using data

}

Have you actually tried this?
Yes, at least for compilation .. I didn't take the time to write a

proper example to see if it give the correct result.

Quote:
It shouldn't even compile,
Can you please explain why not ... upon initially reading your post I

was worried that I had some misunderstanding, but upon reflection and
reading, I cannot see why my static_cast is invalid. (See also my
comments in the reply to the preceding post about reinterpret_Cast).

Quote:
and the
technique for converting a sequence of bytes to something larger is very
fragile, and will fail in many configurations.

Could you please explain in more detail .. excepting the obvious
points about my unsafe error checking of the input stream? I am not
being petulant here .. I really want to understand why my approach is
incorrect (or at least ill-advised), so I don't make a similar error
down the line.

TIA, Dave Moore

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Dave Moore
Guest





PostPosted: Fri Apr 23, 2004 12:18 am    Post subject: Re: EOF problem Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote in message news:<d6652001.0404210233.65db3432 (AT) posting (DOT) google.com>...
Quote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c62l2h$79ute$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

Dave Moore wrote:
[...]
const wchar_t &data = static_cast<const wchar_t &>(*raw_wchar);

Sorry, but this cast is IMHO wrong(though it might slip in practice).

If this compiles, then the compiler is seriously broken.

Or is it? I seem to remember some discussion concerning this, and that
the standard literally says "An expression e can be explicitly converted
to a type T using a static_cast of the form static_cast<T>(e) if the
declaration "T t(e);" is well formed, for some invented temporary
variable t. The effect of such an explicit conversion is the same as
performing the declaration and initialization and then using the
temporary variable as the result of the conversion."

In this case, of course "wchar_t const& t( *raw_wchar ) ; " is definitly
legal. It converts the first element of the array raw_wchar to a
wchar_t, putting the results in a temporary variable, and generates a
reference to it. Thus, if the orignal Unicode character were a wavy
dash ('u3030'), he will convert it to a '0' ('u0030').

Yes, you are exactly right, as I found out myself when I did more
testing ... it is actually a difficult error to detect, since the code
seems correct (at least on a little-endian system) as long as the data
fits entirely into the smaller element, as was the case for my initial
test examples. I finally got wise and tried to reproduce
numeric_limits<wchar_t>::max() using my flawed code, and that is when
the light dawned.

Quote:
(Not all
compilers get this right, however. To be truthful, one somehow expects
that when casting an lvalue to an lvalue, the result of the cast refer
to the original lvalue.)

This was exactly what confused me ... however I then remembered that
you cannot use static_cast to convert between incompatible pointer
types, and that is when I remembered the bit about the temporary
variable from the standard.

(c.f.
http://groups.google.com/groups?selm=MPG.17df12ae8756e04b9896e0%40news.hevanet.com

and

http://groups.google.com/groups?q=g:thl4012587187d&dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=MPG.17d854d67ee5dbf69896dc%40news.hevanet.com
)

Quote:
What Dave Moore was probably trying to do requires a reinterpret_cast.

Yes, as was pointed out by another poster in this thread. Actually
this was a useful exercise for me, since it crystalized the (or at
least a) difference between static_cast and reinterpret_cast ...
something I had not appreciated from reading descriptions in texts and
in the standard.

Quote:
But of course, what he was trying to do doesn't really work, except in a
few special cases.


I guess I understand why now, having read all of your other posts in
this thread ... thank you for the explanations. If I understood them
correctly, you seem to be saying that there is no way (using Standard
C++) to ensure portability of code that converts back and forth
between binary and formatted data, EVEN IF you can guarantee that
everything will happen on a a single (arbitrary) platform. However,
lets assume that I am just trying to solve the problem on a particular
platform where I can find out all of the nitty-gritty details I need
to properly do the conversion. Then, will my approach (using
reinterpret_cast instead of static_cast) still only work in a "few
special cases"? If not, can you point out a better way to do it?

This is a problem looming in my future, since my output-file sizes
(currently ASCII) are getting to be quite large, and I need to think
about using binary format. So, I would really appreciate any further
critical advice you can give on this topic ... simple pointers to
websites would also be helpful. I have done a few online searches,
but none have turned up a truly useful site.

TIA, Dave Moore

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ulrich Eckhardt
Guest





PostPosted: Fri Apr 23, 2004 12:24 am    Post subject: Re: EOF problem Reply with quote

Dave Moore wrote:
Quote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote:
Apart from the fact that this should rather be a reinterpret_cast, the
cast should have been done earlier:

wchar_t wc;
encfile.read( reinterpret_cast<char*>(&wc), sizeof wc);
if(encfile)
{ /* use 'wc' here */ }


First of all, the read call above probably won't work for the OP,
because of the EOF collision he is apparently having. Since read uses
EOF as a default terminator, if it is available from the input, it
will terminate the read .. the OP said as much in his original post.

I don't think so, but I think I need to clarify something first...
EOF is a makro which is inherited from the C API. It is also an abbr. for
'end of file' which was the only way I meant to use it. There is no reason
to use the makro with C++ IOStreams.
Now, istream::read() uses the underlying streambuffer's methods and that
streambuffer has a char_trait which has a function called eof(). eof()
returns an int_type, which is bigger than the streams native char_type
exactly in order to be able to hold all valid values for a char_type _plus_
one special value for signalling the EOF. In other words, the file simply
cannot hold a byte which is equal to char_traits::eof()!

Why and how the OP stumbled across some special value in the file is
something I can only guess. One guess is this:

#define EOF (-1) // common on many systems
char c; // being a signed char on that system
istream in;
c = in.get(); // in.get() returns an int_type, truncated to char
if(c==EOF)
{ ... }

The code above makes a byte with the value 0xff look like the end of the
file. Note that this or similar code could also be hidden in the
stdlibrary, where errors are not unheard of.

Quote:
Second, regarding the reinterpret_cast ... I take Stroustrup's advice
to avoid type-casting whenever necessary, and when it is necessary to
prefer static_cast over reinterpret_cast (c.f. TC++PL v3, sec. 6.2.7).
In this case, my test case using the static_cast compiles correctly
even at the highest warning level, and test code appears to give the
correct result. I realize this is far from a rigorous test, but I
still cannot see why the static_cast I used was incorrect ... perhaps
you can enlighten me further?

The point is that you only change the type, and for that I use reinterpret
cast. The reason I would not use static_cast here is that static cast
applies much more 'force' than reinterpret cast. The latter only lets you
change the type, using a static_cast with a reference will also let you
cast away constantness. Use as little casts as possible is the rule, and
reinterpret_cast is less cast than static_cast with a reference.

Another point was that my solution was less typing. Both points are a matter
of personal taste though.

Quote:
BTW: neither of these solutions address endianess or varying sizes of
wchar_t!!! This rather calls for a carefully applied uint*_t, lest the
portable C++ code create non-portable files.

Ok .. endian-ness is, as you say, not addressed, but the use of SIZE
variable I used in my sample code should allow portability to systems
with a different size of wchar_t.

True. What I wanted to point out was that portable C++ code (which we have
here) is not all. There's also the question of portability of the created
files. If one system uses 16bit, little-endian wchar_t and another uses
32bit or big-endian you won't be able to read files from one another.

Now, let's talk about systems where a char does not have eigth bits .... ;)

Uli

--
FAQ: http://parashift.com/c++-faq-lite/

/* bittersweet C++ */
default: break;

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Fri Apr 23, 2004 12:37 am    Post subject: Re: EOF problem Reply with quote

[email]dtmoore (AT) rijnh (DOT) nl[/email] (Dave Moore) wrote in message
news:<306d400f.0404210456.4b2b31ae (AT) posting (DOT) google.com>...
Quote:
kanze (AT) gabi-soft (DOT) fr wrote in message
news:<d6652001.0404200355.1e97ca5d (AT) posting (DOT) google.com>...
[email]dtmoore (AT) rijnh (DOT) nl[/email] (Dave Moore) wrote in message
news:<306d400f.0404190207.3c4ec8f6 (AT) posting (DOT) google.com>...

[...]
Quote:
2) Now, since you are reading in your data as raw bytes, you must
handle the type-conversion by yourself ... this would most likely
be an issue if you are using a character-type that is longer than
1-btye (e.g. wchar_t).

For all pratical purposes, he's reading bytes, not characters.
Which means:

- he can only use std::ifstream, not std::wifstream,
- he has to open the file using binary, and
- he has to inbue the "C" locale, to ensure that there is no code
translation in the filebuf.

Point taken (at least I think so) .. actually this is beyond my
personal experience, so I didn't know about it. I guess it
invalidates my other post in this thread saying that it is ok to omit
the std::ios::binary when opening the file.

It's definitly not valid to omit std::ios::binary; he's reading binary
data, not text.

Quote:
However, it seems like one or the other of the last two options should
be enough ...

It would seem like that should be the case, wouldn't it? That would be
logical -- if I ask for binary data, I get the binary data, and not some
translation of it.

Logical or not, that's not what the standard says. The use of the
codecvt facet is indepandent of whether the file is opened in binary
mode or not.

Quote:
if std::ios::binary is used, shouldn't the filebuf be left alone in
any case? Conversely, if you imbue the "C" locale, that should ensure
that a character is a byte, in which case bytewise==characterwise.

No. I think I posted the complete explination in another posting (but
maybe that was in the French group). Basically, when reading from an
ifstream, there are three separate levels of "translation" that have to
be dealt with:

- representation of line/record separators and end of file,
- internal code translation, and
- unformatting (skipping white space, converting to int, etc.).

The first is controled by the ios::binary flag. Without it, the file is
read in text mode. Under Unix, this means nothing, and the flag is, in
fact, a no-op. Under Windows, it means that any two character sequence
0x0A, 0x0D seen will be converted into a single 'n', and that any 0x1A
will be interpreted as end of file. On some mainframes, it means that
each (possibly fixed-length) record in the file will have any trailing
spaces stripped, and then a 'n' appended. ('n' characters don't
appear in the file in any form.)

The second is controled by the imbued locale of the filebuf. This can
become very tricky; if you play around with the non-const version of
rdbuf, for example, it is possible for the imbued locale of the filebuf
to be different from the imbued locale of the istream, for example. By
default, the imbued locale (of both) is the current global locale at the
time the objects are constructed. In practice, it would be a very poor
program that didn't start by setting up the global locale according to
the users environment (with the exception of servers and such, which run
without an attached terminal, and in a user independant environment).
I'm not sure what the situation is under Windows, but most of the usual
locales under Linux use UTF-8. Which means that if you don't take any
particular precautions, you'll be translating your input from UTF-8 into
(probably) ISO 8859-1 (or ISO 10646, if reading wchar_t).

The third element is controled by your choice of functions in istream:
the >> operator unformats, the other read functions don't.

Quote:
For example:

while (!encfile.eof()) {

const size_t SIZE=sizeof(wchar_t);
char raw_wchar[SIZE];
for (int i=0; i<SIZE; ++i)
encfile.get(raw_wchar[i]);

const wchar_t &data = static_cast
// code using data

}

Have you actually tried this?

Yes, at least for compilation .. I didn't take the time to write a
proper example to see if it give the correct result.

It shouldn't even compile,

Can you please explain why not ... upon initially reading your post I
was worried that I had some misunderstanding, but upon reflection and
reading, I cannot see why my static_cast is invalid. (See also my
comments in the reply to the preceding post about reinterpret_Cast).

I took my desires for the reality. I'm not convinced that the original
intent in this case was to allow this -- IMHO, a static_cast of an
lvalue to an lvalue type should end up refering to the same object, and
I don't like the fact that the cast to wchar_t const& is legal, but the
cast to wchar_t& isn't, either. But that's all personal opinion. What
actually got written into the standard is that for the cast you wrote,
the compiler must generate a temporary of wchar_t type, initialize it
with *raw_char, and return a reference to that temporary.

So the cast is legal, it just doesn't do what you want (since it totally
ignores all but the first char in raw_wchar). Intuitively, I find the
standard semantics counter intuitive, and I wouldn't be surprised if
some compilers get it wrong, either forbidding the cast, or worse,
implementing it as if it were a reinterpret_cast. (Sun CC is in the
latter case. Both g++ and VC++ get it right, however.)

Quote:
and the technique for converting a sequence of bytes to something
larger is very fragile, and will fail in many configurations.

Could you please explain in more detail .. excepting the obvious
points about my unsafe error checking of the input stream? I am not
being petulant here .. I really want to understand why my approach is
incorrect (or at least ill-advised), so I don't make a similar error
down the line.

The problem is that it depends too much on the representation of a
wchar_t, which can vary greatly from one implementation to the next, or
even from one version of a compiler to the next. Note, for example,
that Microsoft C/C++ has changed the representation of both long and
long double at various times in the past. And it is so easy to do
correctly: decide what representation you want, and implement it. For
example, if I want to transmit a 32 bit wchar_t as four bytes, in
network byte order, I might write:

dest.put( (word >> 24) & 0xFF ) ;
dest.put( (word >> 16) & 0xFF ) ;
dest.put( (word >> Cool & 0xFF ) ;
dest.put( (word ) & 0xFF ) ;

Note that this results in a strictly defined byte order in the file,
regardless of the byte order on the machine at hand.

Reading just does the reverse:

result = dest.get() << 24 ;
result |= dest.get() << 16 ;
result |= dest.get() << 8 ;
result |= dest.get() ;

(Attention: this cannot be written as a single expression. You need
those sequence points.)

Note too that the only potential problem with the representation that
this handles is byte order. I'm a pragmatist at heart: byte order
varies greatly, and I regularly use machines with different byte orders.
And byte order is easy to handle. Other aspects, such as the
representation of negative values, don't vary in practice on modern
machines (unless perhaps you have to deal with mainframe Unisys
processors), and is very difficult to handle. When I defined my own
external formats, I used signed magnitude, because it is easy to
generate regardless of the internal format; the Internet standard is 2's
complement, however, which corresponds to the interal format on every
machine in production today except Unisys mainframes (which use 1's
complement, I think).

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.