C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Please Help!!more string manipulation Qs...in C++

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language (comp.lang.c++)
View previous topic :: View next topic  
Author Message
Hp
Guest





PostPosted: Tue Oct 25, 2005 2:45 am    Post subject: Please Help!!more string manipulation Qs...in C++ Reply with quote



Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

Back to top
Bob Hairgrove
Guest





PostPosted: Tue Oct 25, 2005 11:03 am    Post subject: [OT] Please Help!!more string manipulation Qs...in C++ Reply with quote



On 24 Oct 2005 19:45:33 -0700, "Hp" <prasanna.hariharan (AT) gmail (DOT) com>
wrote:

Quote:
Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

Is this homework??? Sure sounds like it.

If not, why do you have to use C++ at all? Perl or awk, using regular
expressions, is probably much easier for something like this.

At any rate, your question has to do with algorithms, not with the
language itself. Therefore, it is off-topic in this NG.

--
Bob Hairgrove
[email]NoSpamPlease (AT) Home (DOT) com[/email]

Back to top
Hp
Guest





PostPosted: Tue Oct 25, 2005 5:49 pm    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote



It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!

Back to top
int2str@gmail.com
Guest





PostPosted: Tue Oct 25, 2005 5:56 pm    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote


Hp wrote:
Quote:
It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!

Why don't you show some code?
With none of your "project" problems have you shown any code.

Do something! Get stuck, then ask questions!

The comments you get are not unecessary. You are on a C++ _langugae_
newsgroup. Figure something out. Post again when you have _specific_
problems with a language construct and now a "write my program for me"
request!

Cheers,
Andre


Back to top
Hp
Guest





PostPosted: Tue Oct 25, 2005 6:05 pm    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;


int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <
vector vector<string> punct;//Vector of strings to remove the punctuations
from each files
cout<<"This is a sample program"< punct.push_back(",");punct.push_back(":");punct.push_back(";");
punct.push_back("'");
punct.push_back("'");punct.push_back("=");punct.push_back("-");
punct.push_back(".");punct.push_back(",");punct.push_back(",");

for (int i=0;i {
cout< }

std::replace(file.begin(),file.end(),',','');
std::replace(file.begin(),file.end(),';',' ');
std::replace(file.begin(),file.end(),':','');
std::replace(file.begin(),file.end(),'-',' ');
std::replace(file.begin(),file.end(),'=','');
std::replace(file.begin(),file.end(),'+',' ');
std::replace(file.begin(),file.end(),')','');
std::replace(file.begin(),file.end(),'(',' ');
std::replace(file.begin(),file.end(),'&','');
std::replace(file.begin(),file.end(),'!',' ');
std::replace(file.begin(),file.end(),'.','');
std::replace(file.begin(),file.end(),'/',' ');
//Removing single and double quotes
std::replace(file.begin(),file.end(),''','');
std::replace(file.begin(),file.end(),'"',' ');

std::transform(file.begin(),file.end(),file.begin(),tolower);



/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
cout << "After removing 'the'" < */

}
-----------------------------------------------------------------------------------

Back to top
int2str@gmail.com
Guest





PostPosted: Tue Oct 25, 2005 6:31 pm    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote


Hp wrote:
Quote:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:

_Compiling_ code would be nice, too...

Quote:
using namespace std;
using std::string;

This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.

Quote:
int var_len;

Unused?

Quote:

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".

Quote:
vector<string> files;

Unused?

Quote:
vector<string> punct;//Vector of strings to remove the punctuations
from each files

Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.

Quote:
std::replace(file.begin(),file.end(),',','');

You can't replace with a non-character...

Quote:
std::transform(file.begin(),file.end(),file.begin(),tolower);

"tolower" is unfortunately amgigious. You'll have to cast it like this:


std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);

Quote:
/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
*/

You'll need a loop here. A single if won't do.

Cheers,
Andre


Back to top
Hp
Guest





PostPosted: Wed Oct 26, 2005 4:15 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote


[email]int2str (AT) gmail (DOT) com[/email] wrote:
Quote:
Hp wrote:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:

_Compiling_ code would be nice, too...

using namespace std;
using std::string;

This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.

int var_len;

Unused?
I had declared it for future use.


Quote:


FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".

vector<string> files;

Unused?
I had used this vector to read a set of files and read each file into a

string, giving me a vector of string of files that i need to read and
modify.
Quote:

vector<string> punct;//Vector of strings to remove the punctuations
from each files

Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.
Thank you, i think i will do this.

std::replace(file.begin(),file.end(),',','');

You can't replace with a non-character...
This is a typo error, i have it replaced with a space, which got lost

while cutting and pasting.


Quote:

std::transform(file.begin(),file.end(),file.begin(),tolower);

"tolower" is unfortunately amgigious. You'll have to cast it like this:


std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);

Ironically, the code i have written works:-).


Quote:
/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
*/

You'll need a loop here. A single if won't do.
The above piece of code doesnt work. I had initialized remword = "the",

but it was removing 'the' from 'there' too, which i dont want. Also, i
want all the occurances of it to be removed, which i can acheive
through a loop.
Quote:

Cheers,
Andre


Back to top
int2str@gmail.com
Guest





PostPosted: Wed Oct 26, 2005 7:49 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote


Hp wrote:
Quote:
[snipped posted code]
The above piece of code doesnt work.

Alright, even if I am running pretty high danger of doing your
homework, I'll post my version of the program which will read in a file
and remove the stopwords.

The program reads only one file in though and doesn't build the
document/term matrix for you - that's still up to you.

Please try to understand the code and discuss as necessary to help you
learn something from it.

Here ya go:

#include <iostream>
#include <ostream>
#include <fstream>
#include <sstream>
#include <algorithm>
#include <string>
#include <map>

using namespace std;

const string InvalidChars = ",.!?;:=()+-'"&";

char sanitizeChar( const char & c )
{
for( string::const_iterator inv=InvalidChars.begin();
inv!=InvalidChars.end(); ++inv)
{
if ( *inv == c )
return ' ';
}

return tolower( c );
}

int main()
{
ifstream ff_swords( "stopwords.txt" );
ifstream ff_text( "test.txt" );

// TODO: Check if files are open here....

map<string,char> stopwords;

string token;

while( ff_swords >> token )
stopwords[ token ] = 1;

while( ff_text >> token )
{
transform( token.begin(), token.end(), token.begin(),
sanitizeChar );

istringstream ss( token );
while( ss >> token )
{
if ( stopwords.find( token ) != stopwords.end() )
continue;

// TODO: Run token through stemmer here.

// TODO: Add stemmed token to your custom matrix now...

cout << token << endl; // <-- Debug
}
}
}


Back to top
Karl Heinz Buchegger
Guest





PostPosted: Thu Oct 27, 2005 7:23 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

Hp wrote:
Quote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream
#include #include #include #include #include #include
using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
[email]kbuchegg (AT) gascad (DOT) at[/email]

Back to top
Karl Heinz Buchegger
Guest





PostPosted: Thu Oct 27, 2005 7:27 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

Hp wrote:
Quote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream
#include #include #include #include #include #include
using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
[email]kbuchegg (AT) gascad (DOT) at[/email]

Back to top
Karl Heinz Buchegger
Guest





PostPosted: Thu Oct 27, 2005 9:22 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

Hp wrote:
Quote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream
#include #include #include #include #include #include
using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
[email]kbuchegg (AT) gascad (DOT) at[/email]

Back to top
Karl Heinz Buchegger
Guest





PostPosted: Thu Oct 27, 2005 11:39 am    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

Hp wrote:
Quote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream
#include #include #include #include #include #include
using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
[email]kbuchegg (AT) gascad (DOT) at[/email]

Back to top
Karl Heinz Buchegger
Guest





PostPosted: Thu Oct 27, 2005 1:40 pm    Post subject: Re: Please Help!!more string manipulation Qs...in C++ Reply with quote

Hp wrote:
Quote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream
#include #include #include #include #include #include
using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
[email]kbuchegg (AT) gascad (DOT) at[/email]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language (comp.lang.c++) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.