+------ Dan Trevino wrote (Sat, 17-Feb-2007, 17:43 -0500):
|
| I need to parse a tab delimited text file of several thousand lines. The
| first part is easy;
|
| cut -f8 file
This bit of processing can often be moved into whatever interpreter
does the rest of the processing, but this does neatly separate this
function and may be more maintainable for whomever might inherit it.
| field 8 of this file contains multiple, variable length, sentences enclosed
| in double quotes. Example returned by the cut command above:
|
| "this is sentence one. this i sentence two. this ""is a quote that may be""
| in sentence three."
|
| I need to grab the first sentence for further processing (without the
| period, without the beginning quote mark) into a variable, but am having
| difficulty. Can anyone suggest an easy way to do this? I'm open to
| bash,perl,python solutions, but prefer bash.
Well, sed is the easiest tool for the problem as stated:
cut -f8 file | sed 's/^"//; s/\..*//'
I might choose another tool depending on the rest of the processing
to be done and on whether the problem domain might evolve. The sed
command is a bit brittle relative to changing requirements. Much the
same could be done in bash directly, but the sed command is simpler
and faster.
If the first sentence might contain an embedded period, say in a
literal floating point number, you might match on the ". " string
and then delete any trailing period, which might remain if there
were only one sentence, i.e.:
cut -f8 file | sed 's/^"//; s/\. .*//; s/\.$//'
And, if exclamation and question marks may terminate sentences in
the input:
cut -f8 file | sed 's/^"//; s/[\.!?] .*//; s/[\.!?]$//'
This still doesn't handle ellipsis marks... 8-)
How constrained and well behaved is the input???
Best Regards,
Chuck
|