 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
dima@inotech.ru Guest
|
Posted: Sat Sep 24, 2005 2:04 am Post subject: Reading data from file in accordance with regular expression |
|
|
Hello All!
I want to read data from file in accordance with regular
expression.
For example:
.....
.....
std::ifstream ifs("very_big_file.txt");
std::string str;
while ( str = ifs.GetLineByRegEx("^[NM]+$") )
{
std::cout << str << std::endl;
}
.....
.....
Standart C++ class std::ifstream doesn't provide this feature :(
How to implement my problem using C++?
Very important:
I can't store file in memory fully, because it VERY big!
I want to read records directly from file (over stream, etc...).
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Timo Geusch Guest
|
Posted: Sun Sep 25, 2005 1:43 pm Post subject: Re: Reading data from file in accordance with regular expres |
|
|
[email]dima (AT) inotech (DOT) ru[/email] wrote:
| Quote: | Hello All!
I want to read data from file in accordance with regular
expression.
For example:
....
....
std::ifstream ifs("very_big_file.txt");
std::string str;
while ( str = ifs.GetLineByRegEx("^[NM]+$") )
{
std::cout << str << std::endl;
}
|
Why not split the operation in two, read a line from the file using the
provided operations and then use something like boost::regex to check if
the line matches the expression before processing it?
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
dima@inotech.ru Guest
|
Posted: Tue Sep 27, 2005 11:28 am Post subject: Re: Reading data from file in accordance with regular expres |
|
|
Before input file may contain only one very very big line!
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze Guest
|
Posted: Wed Sep 28, 2005 11:07 am Post subject: Re: Reading data from file in accordance with regular expres |
|
|
[email]dima (AT) inotech (DOT) ru[/email] wrote:
| Quote: | I want to read data from file in accordance with regular
expression.
For example:
....
....
std::ifstream ifs("very_big_file.txt");
std::string str;
while ( str = ifs.GetLineByRegEx("^[NM]+$") )
{
std::cout << str << std::endl;
}
....
....
Standart C++ class std::ifstream doesn't provide this feature :(
How to implement my problem using C++?
Very important:
I can't store file in memory fully, because it VERY big!
I want to read records directly from file (over stream, etc...).
|
Well, boost::regex requires a bidirectional iterator, so you
can't use an istream_iterator with it.
The match functions in the current implementation of my regular
expression class don't use templates (one takes a
std::string::const_iterator), because I have to support an older
compiler where they caused problems:-); I have compiled a
version with a templated iterator in the past, using g++,
however (and will give it another try soon -- we've upgraded Sun
CC, so maybe this time).
I have never tested it with anything but a random access
iterator, but the algorithm I use should support input
iterators, and a quick glance at the code suggests that it does;
the only operation I do with the iterator is a *begin ++. I do
save it at times, but the saved value is only used as part of
the result.
Note that you would still have several hurdles to jump:
-- The algorithm is greedy, returning the longest sequence
which matches, which means that it will read characters up
until it has seen a sequence which doesn't match. And it
will be impossible to reread those characters later; if they
should be part of the following match, you're out of luck.
-- You cannot use the iterator returned (the end of match), nor
a copy of the iterator passed in. The normal way of
obtaining the matched string is saving the initial iterator,
then using the two iterator constructor of std::string with
the initial iterator and the returned end of match iterator,
but this doesn't work with input iterators. You would
probably have to design your own iterator, based on an
istream_iterator, which appends the character to a string
each time it is dereferenced.
For file streams, it should be possible to design a forward
iterator, using seek() and tell(). Unless done very, very
carefully, however, this will be either very slow, with a seek
for each character access, or will contain subtle errors -- you
probably need some sort of shared state for all iterators over
the same file. On the other hand, I think that the "saving"
iterator described in the second point, above, is pretty
straight foreward: it would contain the istream_iterator (or an
istreambuf_iterator), and a pointer to an std::string; all
operations except * simply forward directly the the
istream_iterator, and operator* does a push_back on the string
with the results before returning it. (I'll bet that there is
something in Boost iterator adapters which would make this a two
or three liner.)
Two final comments, concerning my regular expression class:
-- I have finally acquired a web site, and will be putting my
library up on it in the next week or two. (There were a
couple additional things I wanted to do, but I figure that
even in its current state, it's better than nothing.) So
you will be able to download the code. (I'll post an
announcement here when it is actually available.)
-- I haven't found some wonderous new algorithm that allows me
to do things the Boost regular expressions don't. My
regular expression class doesn't begin to support all of the
things that Boost's does -- most significantly, in this
case, it doesn't support saved groupings (using the matched
contents of (...) later). By doing less, it has more
freedom in the implementation. Given the Boost
implementation, I would retire it, except that it does do
one thing that Boost can't (given the options they support):
build a full DFA and dump the state tables for use later in
a different program. (The fact that it can work with an
input iterator is a side effect of the fact that it uses a
DFA in the implementation. Another side effect is that it
cannot support reuse of matched (...). Yet another case
where one size doesn't fit all.)
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|