C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Question on best way to parse tab-delimited file

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
George Slappy
Guest





PostPosted: Fri Feb 20, 2004 3:11 pm    Post subject: Question on best way to parse tab-delimited file Reply with quote



[Long time lurker, first time caller.]

Ladies and gentlemen,

I've done a few searches across this and other newsgroups for
information on parsing tab-delimited files, and from the excellent
answers provided on how to read them in, but I could use some advice
on the best way to go for a current hobby project I'm using to teach
myself C++.

Among other things, I have a set of 16-20 tab-delim. text files that I
search multiple times to get various information. Currently, I use
ifstreams, which means I have to incur overhead as I hit the disk each
time (neglecting buffering) I rewind and re-search. I was thinking
about loading the entire files into memory (< 5MB for all together).

The problem is, they all have a few text columns and a few numeric
ones. Plus, each file has a first line that is a header. (OK, that's
easy - skip the first line when reading the file.) How would you
suggest I proceed?

- Should I make a set of N vector - Should I read into an NxM old-school array of strings? (sounds
kludgy)
- Some other C++ guru whiz-bang gee-whiz approach that's even
better?

My criteria, again, are to be able to load all files into memory
(without hard-coding dimensionality of the files, as they may
*possibly* change), be able to search a given column for a value and
then index across to another column in the same row to read a
corresponding value.

I can provide more details if necessary, but anything you can suggest
to help me get a handle on this would be well appreciated. I've
learned a lot so far, including iostreams/fstreams/strstreams and so
on, and I'm really excited!

Thanks folks,
George

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Ulrich Eckhardt
Guest





PostPosted: Sat Feb 21, 2004 4:12 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote



George Slappy wrote:
Quote:
I've done a few searches across this and other newsgroups for
information on parsing tab-delimited files, and from the excellent
answers provided on how to read them in, but I could use some advice
on the best way to go for a current hobby project I'm using to teach
myself C++.

I assume that values are tab-delimited and records are newline
delimited(i.e. they are rows). It's mostly a matter how you implement
operator>>.

Quote:
The problem is, they all have a few text columns and a few numeric
ones.

I assume that all rows still have the same layout, or can at least be
represented by the same C++ datatype.

Quote:
Plus, each file has a first line that is a header.[...]
How would you suggest I proceed?

- Should I make a set of N vector<string> objects for each file?
[...]
- Some other C++ guru whiz-bang gee-whiz approach that's even
better?

My criteria, again, are to be able to load all files into memory
(without hard-coding dimensionality of the files, as they may
*possibly* change), be able to search a given column for a value and
then index across to another column in the same row to read a
corresponding value.

// performance
std::ios_base::sync_with_stdio(false);
// the binary is necessary so that the backend can use
// memory mapping easier
std::ifstream file("file.txt", std::ios_base::binary);
// the only piece of orror-handling I'll show here
validate_header(file);

// copy to a container
// you need operator>> overloaded for struct row
std::istream_iterator<row> begin(file), end;
std::list<row> rows( begin, end);

// functor to find a first column with a certain value
struct comp_first_column
{
// first column is numeric
comp_first_column(int n):m_n(n){}
bool operator()(row const& r) const
{ return row.col1 == m_n; }
};
std::list<row>::iterator it =
std::find_if( rows.begin(), rows.end(),
comp_first_column(42));

hth

Uli

--
FAQ: http://parashift.com/c++-faq-lite/

/* bittersweet C++ */
default: break;


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Francis Glassborow
Guest





PostPosted: Sat Feb 21, 2004 4:13 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote



In message <b361e50.0402192011.6ea30808 (AT) posting (DOT) google.com>, George
Slappy <slappy_g (AT) hotmail (DOT) com> writes
Quote:
The problem is, they all have a few text columns and a few numeric
ones. Plus, each file has a first line that is a header. (OK, that's
easy - skip the first line when reading the file.) How would you
suggest I proceed?

My personal choice in your circumstances would be to read each one into
a stringstream. Now You can use those stringstream objects almost
exactly as you did your file stream ones.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Alexander J. Oss
Guest





PostPosted: Sat Feb 21, 2004 11:05 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote

"George Slappy" <slappy_g (AT) hotmail (DOT) com> wrote

[snip]
Quote:
Among other things, I have a set of 16-20 tab-delim. text files that I
search multiple times to get various information. Currently, I use
ifstreams, which means I have to incur overhead as I hit the disk each
time (neglecting buffering) I rewind and re-search. I was thinking
about loading the entire files into memory (< 5MB for all together).

How about loading the file into a single string object, and then creating an
istringstream from that string, and using the same istream logic you used on
the ifstream?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
George Slappy
Guest





PostPosted: Sat Feb 21, 2004 7:20 pm    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote

[email]slappy_g (AT) hotmail (DOT) com[/email] (George Slappy) wrote in message news:<b361e50.0402192011.6ea30808 (AT) posting (DOT) google.com>...
Quote:
[Long time lurker, first time caller.]

Ladies and gentlemen,

I've done a few searches across this and other newsgroups for
information on parsing tab-delimited files, and from the excellent
answers provided on how to read them in, but I could use some advice
on the best way to go for a current hobby project I'm using to teach
myself C++.

Among other things, I have a set of 16-20 tab-delim. text files that I
search multiple times to get various information. Currently, I use
ifstreams, which means I have to incur overhead as I hit the disk each
time (neglecting buffering) I rewind and re-search. I was thinking
about loading the entire files into memory (< 5MB for all together).

The problem is, they all have a few text columns and a few numeric
ones. Plus, each file has a first line that is a header. (OK, that's
easy - skip the first line when reading the file.) How would you
suggest I proceed?

- Should I make a set of N vector - Should I read into an NxM old-school array of strings? (sounds
kludgy)
- Some other C++ guru whiz-bang gee-whiz approach that's even
better?

My criteria, again, are to be able to load all files into memory
(without hard-coding dimensionality of the files, as they may
*possibly* change), be able to search a given column for a value and
then index across to another column in the same row to read a
corresponding value.

I can provide more details if necessary, but anything you can suggest
to help me get a handle on this would be well appreciated. I've
learned a lot so far, including iostreams/fstreams/strstreams and so
on, and I'm really excited!

Thanks folks,
George

Guys,

Thanks for the answers posted so far!

Ulrich - I'm going to sit down and review yours today (I'm still a bit
of a newbie, so I'm pretty sure I see what you have done, but I want
to make sure I've got my brain around it.)

Francis and Alexander - you both suggest I simply read the file into a
stringstream (via a string). This makes sense, and it would get my
files loaded into RAM, which is good. The only problem I have, and
maybe I'm just getting lazy as hell (but I don't *think* I am), is
that to search a particular column for a string becomes (linearly?)
time intensive. Essentially, if I have the layout:

Col1 Col2 Col3 Col4 ....
a b c d
a2 b2 d c
d c a3 b4

I can't think of a simple way to search for c in row 2(column 4).
Essentially, my modus operandi has been to read in a line, get the
offset of tab N and tab N+1 - using string.find() - and then see if
the value matches. My reason for asking about vectors was that I
could simply search through vector<string> col4.

But I'll go with your kind suggestions, since after seeing the depth
of some of the posts you've made here, it's blatently obvious how well
you know this stuff compared to me. :)

Thanks again, and let me know if you have any alternate ideas.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Alexander J. Oss
Guest





PostPosted: Sun Feb 22, 2004 1:43 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote


"George Slappy" <slappy_g (AT) hotmail (DOT) com> wrote

Quote:
Francis and Alexander - you both suggest I simply read the file into a
stringstream (via a string). This makes sense, and it would get my
files loaded into RAM, which is good. The only problem I have, and
maybe I'm just getting lazy as hell (but I don't *think* I am), is
that to search a particular column for a string becomes (linearly?)
time intensive. Essentially, if I have the layout:

Col1 Col2 Col3 Col4 ....
a b c d
a2 b2 d c
d c a3 b4

I can't think of a simple way to search for c in row 2(column 4).
Essentially, my modus operandi has been to read in a line, get the
offset of tab N and tab N+1 - using string.find() - and then see if
the value matches. My reason for asking about vectors was that I
could simply search through vector<string> col4.

You can load your vector<string> structure while using the istream logic
reading from the istringstream.

<uncompiled and untested, assuming four elements per line>
string fileContent(GetFileContent(filename));
istringstream is(fileContent);
vector<vector v;
unsigned i = 0;
while (is.good())
{
string element;
is >> element;
if (is.good())
{
if (i == 0)
v.push_back(vector<string>());
v.back().push_back(element);
++i;
if (i >= 4)
i = 0;
}
}

If you didn't know the number of elements per line, you could do something
like

vector<vector v;
string fileContent(GetFileContent(filename));
istringstream is(fileContent);
string line;
while (getline(is, line).good())
{
v.push_back(vector<string>());
istringstream lineStream(line);
string element;
while (lineStream.good())
{
lineStream >> element;
if (lineStream.good())
{
v.back().push_back(element);
}
}
}

Of course, if you're working with a tab-delimited file, you could take a
look at boost::tokenizer when reading lines from the file.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Allan Odgaard
Guest





PostPosted: Sun Feb 22, 2004 10:58 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote

[email]slappy_g (AT) hotmail (DOT) com[/email] (George Slappy) wrote in message news:<b361e50.0402210856.4b9456ed (AT) posting (DOT) google.com>...
Quote:
slappy_g (AT) hotmail (DOT) com (George Slappy) wrote in message news:<b361e50.0402192011.6ea30808 (AT) posting (DOT) google.com>...

- Should I make a set of N vector<string> objects for each file?
- Should I read into an NxM old-school array of strings? (sounds
kludgy)

I would go with the matrix -- although not use a C array. The best is
to write your own matrix class (or find one on the net).

The reason is that you probably would like to a) have different types
for each row (e.g. int, std::string etc.) and b) you want to provide
both row and column iterators (the C array can only provide one of
these).

If you get a container geared toward the queries you need to make, it
will simplify the code which work with the container, and that, I
would think, is the main priority. I do not see any advantage in the
other suggestions (other than having the file cached, but the OS will
probably do that for you anyway).

Quote:
maybe I'm just getting lazy as hell (but I don't *think* I am), is
that to search a particular column for a string becomes (linearly?)
time intensive. Essentially, if I have the layout:

Col1 Col2 Col3 Col4 ....
a b c d
a2 b2 d c
d c a3 b4

I can't think of a simple way to search for c in row 2(column 4).

I agree with you. Using the simple C matrix on the other hand will
help, e.g. if we swap the normal order of cols and rows you can use
std::find to find values like this:

// setup
std::string m[COLS][ROWS];
parse_file(filename, m);

// "iterators" for column number 'col'
std::string* first = &m[col][0];
std::string* last = &m[col][ROWS];

std::string* res = std::find(first, last, "c");
if(res != last)
printf("found in row: %dn", res - first);

Here I use raw pointers as iterators (you can write templated helper
functions to obtain them).

An alternative to the C-style matrix is "vector<vector m",
which would make it dynamic. For efficiency you would probably init it
with "vector(vector(COLS), ROWS)" (that allocates memory for ROWS x
COLS elements).

You can then get column iterators using "m[row].begin()" and
"m[row].end()" -- m.begin() and m.end() would iterate vectors (each
representing an entire column), by using something like boost's
transform iterator (http://www-eleves-isia.cma.fr/documentation/BoostDoc/boost_1_29_0/libs/utility/transform_iterator.htm)
you can transform that to an actual value in the column, and thereby
be able to apply all the standard algorithms on both rows and columns
of your matrix.

Again I would suggest writing helper functions to actually create
these iterators.

So in the end you should be able to write code like:

// empty slots in column 5
int emptyCnt = std::count(begin_col(5, m), end_col(5, m), "");

// sum of elements in row 3
int sum = std::accumulate(begin_row(3, m), end_row(3, m), 0);

hmm... the latter require the elements to be ints... either a
transform iterator should be used or simply std::transform:

// sum of elements in row 3
vector<int> v;
std::transform(begin_row(3, m), end_row(3, m), atoi,
back_inserter(v));
int sum = std::accumulate(v.begin(), v.end(), 0);

Here std::transform invokes atoi on each element and stores the result
in v, on which we then calculate the sum...

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
George Slappy
Guest





PostPosted: Sun Feb 22, 2004 5:32 pm    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote

[email]Duff (AT) DIKU (DOT) DK[/email] (Allan Odgaard) wrote in message news:<689e217.0402212210.79c57ded (AT) posting (DOT) google.com>...
Quote:
I would go with the matrix -- although not use a C array. The best is
to write your own matrix class (or find one on the net).

The reason is that you probably would like to a) have different types
for each row (e.g. int, std::string etc.) and b) you want to provide
both row and column iterators (the C array can only provide one of
these).

If you get a container geared toward the queries you need to make, it
will simplify the code which work with the container, and that, I
would think, is the main priority. I do not see any advantage in the
other suggestions (other than having the file cached, but the OS will
probably do that for you anyway).

Your reasoning makes sense. I had noticed one of the other followups
to my posts contained a vector<vector in it. This makes
sense to me as a good way to go, since both dimensions are dynamic.
Hopefully the container overhead is not going to be too high.

<code section removed>

Quote:
You can then get column iterators using "m[row].begin()" and
"m[row].end()" -- m.begin() and m.end() would iterate vectors (each
representing an entire column), by using something like boost's
transform iterator (http://www-eleves-isia.cma.fr/documentation/BoostDoc/boost_1_29_0/libs/utility/transform_iterator.htm)
you can transform that to an actual value in the column, and thereby
be able to apply all the standard algorithms on both rows and columns
of your matrix.

Thank you very much for the link! I'd seen people talking about
boost::whatever and was never quite sure where to look for that stuff.
I'll check out that page quite a bit more today!

I appreciate your help! I'm starting to feel a bit better about
getting a handle on this task I set for myself. In two weeks, I've
gotten to the point of binary file I/O, arbitrary bit-field
reads/writes, and file stream I/O. I may not be the fastest learner
around, but this is really starting to be fun!

Thanks much,
George

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Allan Odgaard
Guest





PostPosted: Mon Feb 23, 2004 11:53 am    Post subject: Re: Question on best way to parse tab-delimited file Reply with quote

[email]slappy_g (AT) hotmail (DOT) com[/email] (George Slappy) wrote in message news:<b361e50.0402220853.4d89d4b4 (AT) posting (DOT) google.com>...
Quote:
Duff (AT) DIKU (DOT) DK (Allan Odgaard) wrote in message news:<689e217.0402212210.79c57ded (AT) posting (DOT) google.com>...

Thank you very much for the link! I'd seen people talking about
boost::whatever and was never quite sure where to look for that stuff.

In the future, Google Is Your Friend! :)

Quote:
I'll check out that page quite a bit more today!

Actually, I was thinking that the transform_iterator trick would also
work with native C arrays, so I tried it, but I could not make the
boost iterator work.

So I ended up writing my own -- I included it below.

template <class UnaryFunction, class Iterator>
struct trans_iter : public std::iterator<
typename std::iterator_traits typename UnaryFunction::result_type,
typename std::iterator_traits<Iterator>::difference_type>
{
typedef trans_iter self;
typedef typename UnaryFunction::result_type value_type;

Iterator cur;
UnaryFunction func;

trans_iter (Iterator it, UnaryFunction f) : cur(it), func(f) { }

self& operator++ () { ++cur; return *this; }
self& operator-- () { --cur; return *this; }
self operator+ (int i) { return self(cur+i, func); }
self operator- (int i) { return self(cur-i, func); }
self& operator+= (int i) { cur += i; return *this; }
self& operator-= (int i) { cur -= i; return *this; }
int operator- (self const& rhs) const { return cur - rhs.cur; }

value_type operator* () const { return func(*cur); }
value_type operator[] (int i) const { return func(cur[i]); }

bool operator== (self const& rhs) const { return cur == rhs.cur; }
bool operator!= (self const& rhs) const { return cur != rhs.cur; }
bool operator< (self const& rhs) const { return cur < rhs.cur; }
};

This however, is probably not the interesting thing, and maybe you can
get the boost version to work (which may have more functionality/be
better implemented/whatever).

The other part was creating the actual iterators (for a 2D C array),
first there are the row iterators, these are easy, as we will just use
pointers to the begin/end of the row. This can be done like this:

template T* begin_row (int row, T(&m)[ROWS][COLS])
{ return &m[row][0]; }

template <typename T, int ROWS, int COLS>
T* end_row (int row, T(&m)[ROWS][COLS])
{ return &m[row][COLS]; }

Next are the tricky column iterators. What we are going to do is
iterate the 2D array as a 1D array of arrays -- then we need a
function to pull out the actual element we want (from this "inner"
array). The function is written as an "adaptable function":

template <typename T, int COLS>
struct select_col : std::unary_function<int(&)[COLS], T>
{
int col;
select_col (int c) : col(c) { }
T operator() (T(&v)[COLS]) const { return v[col]; }
};

I.e. we are given a reference to an array of size COLS and we return
the col'th element.

Now to construct the column iterators:

template <typename T, int ROWS, int COLS>
trans_iter<select_col
begin_col (int col, T(&m)[ROWS][COLS])
{
return trans_iter<select_col T(*)[COLS]>(&m[0], select_col<T, COLS>(col));
}

template <typename T, int ROWS, int COLS>
trans_iter<select_col
end_col (int col, T(&m)[ROWS][COLS])
{
return trans_iter<select_col T(*)[COLS]>(&m[ROWS], select_col<T, COLS>(col));
}

The trick lies in turning the reference to the 2D array with type
T(&)[R][C] into a pointer to a 1D array with type T(*)[R], that way,
when we increment that pointer, we actually skip the entire 1D array
(of size R).

And that is basically it... now the fun stuff can start:

// declare 2D matrix
const int ROWS = 20, COLS = 10;
int m[ROWS][COLS];

// fill with 1's and then do a partial sum
fill(begin_row(0, m), begin_row(ROWS, m), 1);
partial_sum(begin_row(0, m), begin_row(ROWS, m), begin_row(0, m));

// copy row 2 to stdout
copy(begin_row(2, m), end_row(2, m),
ostream_iterator<int>(cout, ", "));
cout << endl;

// copy col 2 to stdout
copy(begin_col(2, m), end_col(2, m),
ostream_iterator cout << endl;

// create a string matrix of same size
string strM[ROWS][COLS];

// transform the int matrix to ascii representation
transform(begin_row(0, m), begin_row(ROWS, m),
begin_row(0, strM), as_string);

// using this function
string as_string (int i)
{
string s = "<";
return isprint(i) ? (s += i, s += ">") : ("<?>");
}

// function to dump the entire matrix to stdout
template <typename T, int ROWS, int COLS>
void dump (T(&m)[ROWS][COLS])
{
for(int row = 0; row != ROWS; ++row)
{
copy(begin_row(row, m), end_row(row, m),
ostream_iterator<T>(cout, ", "));
cout << endl;
}
}

// so dump the string matrix
dump(strM);

There is one problem with the above -- the row iterators do support
assignment, but the column iterators do not! That is because the
select_col functor returns by-value, it can be changed to return
by-reference, but that will break the trans_iter, because it defines
(by means of std::iterator) pointer and reference types (with * and &)
based on the value_type (which is now a reference) -- you can modify
trans_iter to not use * and & for these types, this will make it a
little hardcoded to the above, but it will make it work.

I don't know if there is a better ways?!?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.