 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
michael@preece.net Guest
|
Posted: Fri Oct 28, 2005 1:29 am Post subject: Re: Handling delimited strings |
|
|
[email]websnarf (AT) gmail (DOT) com[/email] wrote:
| Quote: | michael (AT) preece (DOT) net wrote:
What would be the best programming languages for doing the following on
*nix and/or Microsoft?
Well, you might like to look at "The Better String Library" which is a
library written in C and C++, which adds useful string functionality to
those languages.
I'd like a command that can be peformed/executed/run (ASAP) with the
following parameters:
ACTION : Input : One of COUNT, LOCATE, INSERT, DELETE, REPLACE, EXTRACT
STRING : Input : Any variable length type-less string.
In Bstrlib you can call bgets() or similar functions on top of fgetc.
FILENAME : The full path name of a flat ASCII text file
You would need to add parsing restrictions yourself. Its not hard, but
you might like to add in another library called PCRE (Perl Compatible
Regular Expressions) to help you with this.
POSITION : A reference, or pointer, to a sub-string in FILENAME
C has ftell().
ORDER : Ascending or Descending (only relevant for LOCATE)
Bstrlib has bstrcmp().
JUSTIFICATION : Right or Left (only relevant for LOCATE)
Bstrlib has bJustifyRight(), bJustifyLeft(), bJustifyMargin(),
bJustifyCenter().
RETURN_VALUE : Output : Only applicable to COUNT, LOCATE & EXTRACT.
I don't know understand what this is. You can pass out bstrings as
return values from functions with no trouble.
The COUNT action would return the number of delimiters in FILENAME at
the specified POSITION.
Bstrlib has a number of scanning and splitting functions from which you
can derive this.
The LOCATE action would return the position of STRING within the larger
string at POSITION in FILENAME, according to the specified ORDER and
JUSTIFICATION.
The INSERT action would insert STRING into FILENAME at POSITION.
binsert()!
The DELETE action would remove the string at POSITION from FILENAME.
bdelete()!
The REPLACE action would replace the string at POSITION in FILENAME
with STRING.
breplace()!
The EXTRACT action would return the string at POSITION in FILENAME.
FILENAME would contain nested strings. The highest-level delimiter
would be ASCII char 254. Within strings delimited by char 254s would be
strings delimited by char 253s, which would in turn contain strings
delimited by char 252s... etc.., all the way down to char 128s.
This is an interesting idea. As I said, Bstrlib does have good
scanning functions. So it would be fairly straight forward to write a
recursive function that did this inside of the Bstrlib scanning
mechanisms.
POSITION would refer to any substring. For example, POSITION "1,2,3"
would refer to the third substring delimited by char 252, within the
second substring delimited by char 253, within the first substring
delimited by char 252.
Right. One way to do this is to write a vararg function that is -1
delimited, and you would just iterate through the parameters and get
references to each sequence substring via bsplitcb(). Building nested
references in Bstrlib is no problem.
Part of your problem is that it seems like you want to operate in
files. That's fine, and Bstrlib can still help you with that, but its
going to be really slow. Deleting and inserting segments of data in
flat files is just going to be fairly slow no matter what. (Perhaps
this is not so true with ReiserFS v4.) It would probably be better to
operate on in-memory structures and have a "READ" and "WRITE"
operations.
--
Paul Hsieh
http://www.pobox.com/~qed
http://bstring.sf.net/
|
Thanks for the advice Paul - and I hope you don't mind that I
cross-posted this reponse. I'm not a C or C++ or even Perl programmer
and, although I'm advised this is not too difficult for someone with
the required skill-set, I'd like to canvass more suggestions before
tackling the ascent of the learning curve.
Cheers
Mike.
|
|
| Back to top |
|
 |
michael@preece.net Guest
|
Posted: Sat Oct 29, 2005 6:16 am Post subject: Re: Handling delimited strings |
|
|
[email]spinoza1111 (AT) yahoo (DOT) com[/email] wrote:
| Quote: | One problem I see: your delimiters being in the range 128..255 rather
assume that "real" file identifiers will not contain characters with
these values, and this is not the case in Windows.
International experience (the use of Chinese characters in Windows file
identifiers) has shown me that owing to the proprietary character of
Windows, the file identifier's syntax was never defined, to my
knowledge, formally and instead a minimal file syntax applies where ANY
unicode character other than the semicolon, backslash, asterisk and
question mark can be and will be accepted by most Windows installations
as part of the file id.
It is well known also that the period doesn't left-delimit the file
type, instead the file name to the right of the type can contain
multiple periods with the right period delimiting the type.
If Microsoft means Microsoft, then I suggest you BNF formulate the minimal
syntax of a file identifier and use this to parse the file identifier.
|
Sorry. I'm a bit confused. I was only looking for something to handle
delimited text strings within a single file. How do Microsoft's file naming
"conventions" come into it. Were you expanding on the idea of using
ReiserFS instead of a program? I realise that the characters within
each string will be limited to the ASCII chars 0-127 inclusive (except
that I'd also like to exclude char0).
If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact). I guess you could say that any
characters allowed in XML should be allowed. Further.. think of two
associated delimited strings - one to hold markup etc., the other the
data.
Mike.
|
|
| Back to top |
|
 |
Steve O'Hara-Smith Guest
|
Posted: Sat Oct 29, 2005 7:39 am Post subject: Re: Handling delimited strings |
|
|
On 28 Oct 2005 23:16:53 -0700
[email]michael (AT) preece (DOT) net[/email] wrote:
| Quote: | If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact).
|
If that's the goal look into ASN1.
--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
|
|
| Back to top |
|
 |
spinoza1111@yahoo.com Guest
|
Posted: Sun Oct 30, 2005 11:42 am Post subject: Re: Handling delimited strings |
|
|
[email]michael (AT) preece (DOT) net[/email] wrote:
| Quote: | spinoza1111 (AT) yahoo (DOT) com wrote:
One problem I see: your delimiters being in the range 128..255 rather
assume that "real" file identifiers will not contain characters with
these values, and this is not the case in Windows.
International experience (the use of Chinese characters in Windows file
identifiers) has shown me that owing to the proprietary character of
Windows, the file identifier's syntax was never defined, to my
knowledge, formally and instead a minimal file syntax applies where ANY
unicode character other than the semicolon, backslash, asterisk and
question mark can be and will be accepted by most Windows installations
as part of the file id.
It is well known also that the period doesn't left-delimit the file
type, instead the file name to the right of the type can contain
multiple periods with the right period delimiting the type.
If Microsoft means Microsoft, then I suggest you BNF formulate the minimal
syntax of a file identifier and use this to parse the file identifier.
Sorry. I'm a bit confused. I was only looking for something to handle
delimited text strings within a single file. How do Microsoft's file naming
"conventions" come into it. Were you expanding on the idea of using
ReiserFS instead of a program? I realise that the characters within
each string will be limited to the ASCII chars 0-127 inclusive (except
that I'd also like to exclude char0).
|
OK, my mistake. Thought you were parsing a file name. You said "in
filename" and not "in the file".
| Quote: |
If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact). I guess you could say that any
characters allowed in XML should be allowed. Further.. think of two
associated delimited strings - one to hold markup etc., the other the
data.
|
|
|
| Back to top |
|
 |
michael@preece.net Guest
|
Posted: Mon Oct 31, 2005 12:37 am Post subject: Re: Handling delimited strings |
|
|
[email]spinoza1111 (AT) yahoo (DOT) com[/email] wrote:
| Quote: |
OK, my mistake. Thought you were parsing a file name. You said "in
filename" and not "in the file".
|
I guess you see now that I meant "in the file called FILENAME". Sorry
for the confusion. The capitals aren't meant to be read loud btw - it's
just a kind of notation that has become a habit, where the capitalized
word relates to a declared variable (or constant). Well - I know what I
mean ;-)
Cheers
Mike.
|
|
| Back to top |
|
 |
michael@preece.net Guest
|
Posted: Mon Oct 31, 2005 1:02 am Post subject: Re: Handling delimited strings |
|
|
Steve O'Hara-Smith wrote:
| Quote: | If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact).
If that's the goal look into ASN1.
|
Isn't ANS1 mostly about encoding data *type* - along with the data?
That's a separate issue. I'm looking to handle nested delimited strings
of any, or no specified, type. The data type (required for conversion
to/from ASN1, say) of each delimited string, or group of strings, along
with any other metadata such as markup, can be described or defined in
an associated nested delimited string, or two, or three, or whatever.
Nested data is all around. Indented program code, newsgroups and the
threads within them, folders/directories, etc. etc.. It would be nice
to have a really simple way to represent and manipulate nested
structures up to 128 levels deep - much simpler than ASN1 and much more
compact than XML, and yet easily transformable into either, or any
other, format.
If you take any nested data in any format - XML is an obvious example -
it should be possible to represent it as a simple delimited string as I
described in my OP. It would be good, I reckon, if I (with a little
help) can come up with simple cross-platform tools to perform the
functions also described in my OP.
Cheers
Mike.
|
|
| Back to top |
|
 |
Steve O'Hara-Smith Guest
|
Posted: Mon Oct 31, 2005 10:05 am Post subject: Re: Handling delimited strings |
|
|
On 30 Oct 2005 17:02:25 -0800
[email]michael (AT) preece (DOT) net[/email] wrote:
| Quote: |
Steve O'Hara-Smith wrote:
If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact).
If that's the goal look into ASN1.
Isn't ANS1 mostly about encoding data *type* - along with the data?
|
I looked around for some references to give you and I found
it hard to spot the nested tag-length-value mechanism I met as ASN.1
around 1990 in the documentation for ASN.1 now. I think it's still there
under the hood of standard types and constructions though.
The essence of what I was thinking about was nested TLV
structures which always seemed to me to be more robust than the
paired delimiters of XML.
--
C:>WIN | Directable Mirror Arrays
The computer obeys and wins. | A better way to focus the sun
You lose and Bill collects. | licences available see
|
|
| Back to top |
|
 |
Michael Wojcik Guest
|
Posted: Tue Nov 01, 2005 5:35 pm Post subject: Re: Handling delimited strings |
|
|
[Followups restricted to comp.programming.]
In article <20051031100540.3a7c3281.steveo (AT) eircom (DOT) net>, Steve O'Hara-Smith <steveo (AT) eircom (DOT) net> writes:
| Quote: |
The essence of what I was thinking about was nested TLV
structures which always seemed to me to be more robust than the
paired delimiters of XML.
|
What would make TLV (by which I assume you mean type-length-value
vectors, presumably with binary, fixed-length encodings for type and
length) more robust than XML? It has less redundancy, and therefore
less capacity for error detection and correction.
A trivial example: say type is a single octet, and all 256 type codes
are defined. Then it is impossible to detect if a type value is
wrong (for whatever reason - program error, transmission error, etc),
without additional context.
XML makes many tradeoffs, and there are certainly applications where
a TLV encoding of some sort is preferable due to various plausible
constraints. But TLV is not "more robust" than XML in general.
That said, I agree that nested TLV structures looks like a better
choice for representing arbitrary structure data than the OP's
proposal of in-band signalling with special flag bytes. That means
restricting the domain of ordinary data values, which means some
kind of shift-encoding of values that are outside that doman, and
that's invariably a mess, error-prone, difficult to enhance while
maintaining backward compatibility, and inefficient.
--
Michael Wojcik [email]michael.wojcik (AT) microfocus (DOT) com[/email]
Unfortunately, as a software professional, tradition requires me to spend New
Years Eve drinking alone, playing video games and sobbing uncontrollably.
-- Peter Johnson
|
|
| Back to top |
|
 |
Dave Thompson Guest
|
Posted: Mon Nov 14, 2005 7:26 am Post subject: Re: Handling delimited strings |
|
|
On 30 Oct 2005 17:02:25 -0800, [email]michael (AT) preece (DOT) net[/email] wrote:
| Quote: |
Steve O'Hara-Smith wrote:
If you're wondering where I'm heading with this, think of nested data -
like XML (only far more compact).
If that's the goal look into ASN1.
Isn't ANS1 mostly about encoding data *type* - along with the data?
That's a separate issue. I'm looking to handle nested delimited strings
of any, or no specified, type. The data type (required for conversion
to/from ASN1, say) of each delimited string, or group of strings, along
with any other metadata such as markup, can be described or defined in
an associated nested delimited string, or two, or three, or whatever.
Not inherently. ASN.1 is about encoding any structure defined in a |
(specified) data language.
You could certainly do n-ary trees of character strings as array of
(discriminated) either string or (recursively) tree of strings. And
since these types have different primitive tags, you don't need any
added application tags. IIRC, may not be exactly right, I don't
currently have tools or references at hand to check:
StringTree ::= SEQUENCE OF CHOICE { IA5String, StringTree }
or to include the (trivial) case of only one string
StringTree ::= CHOICE { IA5String, SEQUENCE OF StringTree }
ASN.1 is frequently, I think probably more often than not, _used_ in
applications where it is desirable to encode data with type to allow
for extensibility and upgradability in distributed applications. For
example in crypto applications, the ones I have mostly worked on, when
we want to transmit or store a key, what is in the key depends on the
algorithm used, and we know from experience that over time new
algorithms will be created and wanted, so standards like X.509 and
PKCS 4, 10, 8/12 have ASN.1 constructs roughly equivalent to:
struct { OID-identifying-algorithm , data-depending-on-that-OID }
That way when some subset of the users and systems add a new
algorithm, the other ones can unambiguously recognize that it's
something they don't know (yet); and with only a little care in
defining the ASN.1 they can skip the data they don't understand, and
as long as they don't actually need to process that data (only store
or forward it etc.) can proceed OK without even being upgraded. This
is useful for applications that want it, but not mandatory.
That said, I basically concur with mwojcik: ASN.1 is _a_ choice, with
advantages and disadvantages; there are others. One of the features,
IMO often a disadvantage, it shares with XML is that both are designed
very generally, to handle essentially everything anybody wants, so
tools that handle that generality are usually complex and arguably
bloated. But if you don't use those tools and develop your own more
limited specific ones you (must) reimplement quite a few wheels.
- David.Thompson1 at worldnet.att.net
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|