 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Seungbeom Kim Guest
|
Posted: Tue May 16, 2006 11:21 am Post subject: peek() vs unget(): which is better? |
|
|
I'm writing a simple lexer. It has to determine when to stop reading
for the current token, and it seems to have basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and continue
Which is better? Or are they equally good?
It seems to me that (1) makes the code more cluttered and incurs two
unformatted input function per character. But I have read somewhere
that unget() is not guaranteed to work across buffer boundaries, so
I suspect (2) is rather unsafe though simple. Is this correct?
Comments about any other part of the implementation is welcome, too.
Thank you in advance.
------------------------------------------------------------------------
token get_token(std::istream& is)
{
typedef std::istream::traits_type traits;
char c;
int i;
// skip whitespaces
while (is.get(c) && std::isspace(c)) { }
if (!is) {
// end of input
}
else if (std::isalpha(c)) {
std::string s(1, c);
// Approach (1) is used here: peek() and get()
while ((i = is.peek()) != traits::eof()
&& std::isalnum(c = traits::to_char_type(i))) {
is.get(c);
s.push_back(c);
}
// got a string
}
else if (std::isdigit(c)) {
std::string s(1, c);
// Approach (2) is used here: get() and unget()
while (is.get(c) && std::isdigit(c))
s.push_back(c);
if (is) is.unget();
// got an integer
}
// and so on
}
------------------------------------------------------------------------
--
Seungbeom Kim
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
kanze Guest
|
Posted: Wed May 17, 2006 2:21 am Post subject: Re: peek() vs unget(): which is better? |
|
|
Seungbeom Kim wrote:
| Quote: | I'm writing a simple lexer. It has to determine when to stop reading
for the current token, and it seems to have basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and continue
Which is better? Or are they equally good?
|
Better in what sense? For various reasons, I feel more at home
using peek(), so I use peek. Somehow, it seems more rational to
look at the character, then consume it, rather than to consume
it, then put it back. More generally, I think it's usually
clearer in the code when you need a new character than it is
when you have to reject the character you already have. (And of
course: would you systematically write *iter++, and then do
--iter if you found you'd gone too far, or would you write
*iter, and then ++iter when you wanted to advance.)
| Quote: | It seems to me that (1) makes the code more cluttered and
incurs two unformatted input function per character. But I
have read somewhere that unget() is not guaranteed to work
across buffer boundaries, so I suspect (2) is rather unsafe
though simple. Is this correct?
|
At least one character of unget() is guaranteed. Typically,
unget() will work as long as you don't cross buffer boundaries,
but this isn't guaranteed. (For that matter, the input might be
unbuffered -- which means in practice a single character
buffer.)
| Quote: | Comments about any other part of the implementation is
welcome, too. Thank you in advance.
------------------------------------------------------------------------
token get_token(std::istream& is)
{
typedef std::istream::traits_type traits;
char c;
int i;
// skip whitespaces
while (is.get(c) && std::isspace(c)) { }
|
Which results in undefined behavior. You can't call the
one-parameter version of isspace with a char, and expect to get
away with it. (In practice, both Solaris and the ctype.h used
by g++ under Linux make it work for all characters except 'ÿ'.
But it's still undefined behavior according to the standard.)
I'd write:
while ( isspace( is.peek() ) ) {
is.get() ;
}
More likely, I'd write something a little more complicated,
using std::ctype, so that my code would be independant of the
global locale. But I'd definitely use peek() like this.
Unless, of course, performance raised its head. In that case,
I'd use the streambuf directly, e.g.:
streambuf* sb = is.rdbuf() ;
if ( sb == NULL ) {
// Handle error, probably shouldn't happen...
}
while ( isspace( sb->sgetc() ) ) {
sb->sbumpc() ;
}
Typically, the low level streambuf functions are inline, and
have a very low cost, but if for some reason, I didn't want to
call them more than necessary :
int lookAhead = sb->sgetc() ;
while ( isspace( lookAhead ) ) {
lookAhead = sb->snextc() ;
}
The use of a variable lookAhead and sb->snextc() is probably the
fastest solution available, and IMHO, is also very readable.
The one place you have to watch out is to ensure that eofbit
gets set in is if you see an end of file here.
Using <locale>, of course, this would become:
typedef std::ctype< char >
CType ;
CType const& ctype
= std::use_facet< CType >( std::locale::classic() ) ;
// or
// = std::use_facet< CType >( is.getloc() ) ;
// depending on whether you are imposing an encoding, or
// you want to accept that of the file.
int lookAhead = sb->snextc() ;
while ( lookAhead != EOF
&& ctype.is( CType::space, (char)lookAhead ) ) {
lookAhead = sb->snextc() ;
}
(In a stand-alone application, I'd probably force the global
C-style locale, and use ::isspace( int ). Unless I wanted to
handle different input encodings. But then, neither <locale> nor
<locale.h> are much help; in UTF-8, the multibyte encoding 0xC2,
0xA0 is a space, for example.)
The rest should follow from the strategy used in skipping
blanks. Just be careful -- you have three different ways to
check for the type of a character in C++, and the simplest
(which you are apparently trying to use) doesn't work with a
variable of type char. Basically, it's:
::isxxx( int ch )/::iswxxx( wint_t ch )
ch == EOF || (ch >= 0 && ch <= UCHAR_MAX) for the char
version. All functions return != 0 for EOF, which can be
used to avoid an external check. Depends on the global
locale -- depending on the application, that's either not a
problem, or it can cause all sorts of problems.
template< typename charT >
std::isxxx( charT ch, locale const& )
Defined only if charT is char or wchar_t, doesn't work for
EOF (because EOF is not representable in a character type),
and requires two parameters. I suspect that it's also
fairly slow; it must call std::use_facet for each
invocation.
In fact, I think this one was only designed for occasional
use.
template< typename charT >
std::ctype< charT >.is( std::ctype_base::mask test, charT ch )
Defined only if charT is char or wchar_t. Requires
explicitly extracting the ctype facet from the locale
beforehand. Doesn't work for EOF.
There are also functions in std::ctype for scanning over
characters which are/are not xxx. Regretfully, they only
work on charT const*, which makes them pretty useless here
(and in just about any code I write).
None of the above handle multibyte encodings, like UTF-8; the
only way to do that within standard C++ is to read from a
wistream, with the appropriate locale to convert the UTF-8 into
Unicode (UCS-4), and use the wchar_t verions of the above
functions. Supposing such a locale exists in your
implementation, of course. (And that it supports UCS-4 -- in some
implementations, wchar_t is only 16 bits, which makes such
support impossible.)
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Thomas Richter Guest
|
Posted: Wed May 17, 2006 2:21 am Post subject: Re: peek() vs unget(): which is better? |
|
|
Seungbeom Kim wrote:
| Quote: | I'm writing a simple lexer. It has to determine when to stop reading
for the current token, and it seems to have basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and continue
Which is better? Or are they equally good?
It seems to me that (1) makes the code more cluttered and incurs two
unformatted input function per character. But I have read somewhere
that unget() is not guaranteed to work across buffer boundaries, so
I suspect (2) is rather unsafe though simple. Is this correct?
|
No. unget() is guaranteed to un-do one get(), and only that. But
this part is perfectly safe. What you cannot do is to un-get more
than one character a time, i.e.:
get(), get(), unget(), unget() might or might not work, whereas
get(), unget(), get(), unget() is fine.
So long,
Thomas
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Carl Barron Guest
|
Posted: Wed May 17, 2006 2:21 am Post subject: Re: peek() vs unget(): which is better? |
|
|
In article <e4bkdg$4$1 (AT) news (DOT) Stanford.EDU>, Seungbeom Kim
<musiphil (AT) bawi (DOT) org> wrote:
| Quote: | I'm writing a simple lexer. It has to determine when to stop reading
for the current token, and it seems to have basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and continue
Which is better? Or are they equally good?
It seems to me that (1) makes the code more cluttered and incurs two
unformatted input function per character. But I have read somewhere
that unget() is not guaranteed to work across buffer boundaries, so
I suspect (2) is rather unsafe though simple. Is this correct?
Comments about any other part of the implementation is welcome, too.
Thank you in advance.
You are not using formatting so I'd drop to streambuffer and since it |
is sequential in loops an std::istreambuf_iterator<char> provides an
input iterator [one pass thru the input]. using istreambuf_iterators
allow simple for loops to implement the loops. The only time you need
to
put back a char is if the loops below exit with begin != end [begin ==
end means either you have an eof or an input error [bad disk etc...]
for example:
#include <streambuf>
#include <string>
#include <iterator>
#include <cctype>
const int STRING_TOKEN = 256;
const int INT_TOKEN = 257;
const int EOF_TOKEN = 258;
int lexer(std::streambuf *sb,std::string &value)
{
value.clear();
std::istreambuf_iterator<char> begin(sb),end;
// zkip initial whitespace
while(begin != end && std::isspace((unsigned int)(*begin)))
++begin;
if(begin != end)
{
if(std::isalpha((unsigned int)(*begin)))
{
value += *begin;
for(++begin;begin!=end && std::isalnum((unsigned
int)(*begin));++begin)
value += *begin;
if(begin!=end) // put invalid char back
sb->sungetc();
return STRING_TOKEN;
}
else if(std::isdigit((unsigned int)(*begin)))
{
value += *begin;
for(++begin;begin!=end && std::isdigit((unsigned
int)(*begin));++begin)
value += *begin;
if(begin!=end) // put invalid char back
sb->sungetc();
return INT_TOKEN;
}
else
return *begin; // other chars +- etc.
}
return EOF_TOKEN; // end of input
}
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
kanze Guest
|
Posted: Wed May 17, 2006 3:21 pm Post subject: Re: peek() vs unget(): which is better? |
|
|
Carl Barron wrote:
| Quote: | In article <e4bkdg$4$1 (AT) news (DOT) Stanford.EDU>, Seungbeom Kim
musiphil (AT) bawi (DOT) org> wrote:
I'm writing a simple lexer. It has to determine when to
stop reading for the current token, and it seems to have
basically two options:
(1) peek(), and if valid for the current token, get() and continue
(2) get(), and if not valid for the current token, unget() and
continue
Which is better? Or are they equally good?
It seems to me that (1) makes the code more cluttered and incurs two
unformatted input function per character. But I have read somewhere
that unget() is not guaranteed to work across buffer boundaries, so
I suspect (2) is rather unsafe though simple. Is this correct?
You are not using formatting so I'd drop to streambuffer
and since it is sequential in loops an
std::istreambuf_iterator<char> provides an input iterator
[one pass thru the input].
|
That's an interesting idea. I'd probably keep the istream at
the interface level, however, and make sure I set failbit (and
eofbit) in it when appropriate. At least in more or less
generic code, which I expected other people to use -- if the
code is within a project, and I know that only the lexer will be
used to read from the file, it probably isn't worth bothering
about.
My real question, however, is what the istreambuf_iterator buys
you compared to using the streambuf functions sgetc(), sbumpc()
and snextc()? Particularly as you are using the old, C-style
functions from <ctype.h>, which can be passed the results of the
streambuf functions directly, and handle EOF implicitly. (If
you're using std::ctype<char>::is(), it becomes more a question
of taste, since you have to test for end of file separately
anyway. And while I don't particularly like the two iterator
idiom here, it's probably a lot better known amongst "average"
C++ programmers than streambuf is, and the actual names of the
streambuf functions don't make things any easier for those that
don't know it. On the other hand, it's still one extra level of
abstraction which doesn't really do anything.)
| Quote: | using istreambuf_iterators allow simple for loops to
implement the loops. The only time you need to put back a
char is if the loops below exit with begin != end [begin ==
end means either you have an eof or an input error [bad
disk etc...]
|
That's also true for the peek()/get() idiom, if used correctly.
There's an almost 100% correspondence :
iterator istream streambuf
*in in.peek() in->sgetc()
*in ++ in.get() in->sbumpc()
*++ in --- in->snextc()
Of course, most of the time, you'll probably end up just
incrementing, e.g. ++ in, or ignoring the return value of
in.get() or in->sbumpc(). In this sense, the iterator is
perhaps marginally clearer -- but I still prefer using a
sentinal value for EOF, and not having to test for it
separately.
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
Carl Barron Guest
|
Posted: Thu May 18, 2006 11:21 pm Post subject: Re: peek() vs unget(): which is better? |
|
|
In article <1147873052.622734.238390 (AT) j33g2000cwa (DOT) googlegroups.com>,
kanze <kanze@gabi-soft.fr> wrote:
| Quote: |
My real question, however, is what the istreambuf_iterator buys
you compared to using the streambuf functions sgetc(), sbumpc()
and snextc()? Particularly as you are using the old, C-style
functions from <ctype.h>, which can be passed the results of the
streambuf functions directly, and handle EOF implicitly
|
The iterator approach is probably easier to follow and write off the
cuff. :)
The code using sbumpc() and <cctype> directly and using do while loops
can shorten space and time.
If <locale>'s ctype is used then hold a const reference of the ctype
facet and test chars via this reference, rather than convenience
functions that look like <cctypes>. then less is gained by directly
accessing sb->bumpc() over the iterator approach,
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
kanze Guest
|
Posted: Sat May 20, 2006 9:21 pm Post subject: Re: peek() vs unget(): which is better? |
|
|
Carl Barron wrote:
| Quote: | In article <1147873052.622734.238390 (AT) j33g2000cwa (DOT) googlegroups.com>,
kanze <kanze@gabi-soft.fr> wrote:
My real question, however, is what the istreambuf_iterator buys
you compared to using the streambuf functions sgetc(), sbumpc()
and snextc()? Particularly as you are using the old, C-style
functions from <ctype.h>, which can be passed the results of the
streambuf functions directly, and handle EOF implicitly
The iterator approach is probably easier to follow and
write off the cuff.
|
Not for me. Using two iterators, as opposed to a single object?
Having to make a separate test for EOF, rather than it being a
sentinal value that is read like any other?
Of course, if you're not 100% familiar with streambuf, the names
of the functions aren't necessarily going to help understanding
what is going on:-).
| Quote: | The code using sbumpc() and <cctype> directly and using do
while loops can shorten space and time.
If <locale>'s ctype is used then hold a const reference of
the ctype facet and test chars via this reference, rather than
convenience functions that look like <cctypes>. then less is
gained by directly accessing sb->bumpc() over the iterator
approach,
|
You still gain in only needing a single object, rather than two.
For a long time, I've used an IteratorStreambuf< Iter > for
parsing, in order to handle the case where the data was
delivered to me in the form of an iterator. More recently, I
devised a ParserSource hierarchy, to avoid all of the extra
overhead (buffering, etc.) of a streambuf when I actually had
iterators. But the ParserSource conforms to the streambuf idiom
of 1) reference semantics (necessary if virtual functions are to
work) and 2) returning an int_type, and using a sentinal value
to signal the end.
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ] |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|