LISTSERV mailing list manager LISTSERV 16.0

Help for LINUX-L Archives


LINUX-L Archives

LINUX-L Archives


LINUX-L@LISTS.UFL.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

LINUX-L Home

LINUX-L Home

LINUX-L  2007

LINUX-L 2007

Subject:

Re: file parsing help

From:

Charles Seeger <[log in to unmask]>

Reply-To:

Platform Independent Linux List! <[log in to unmask]>

Date:

Sun, 18 Feb 2007 01:24:11 -0500

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (49 lines)

+------ Dan Trevino wrote (Sat, 17-Feb-2007, 17:43 -0500):
|
| I need to parse a tab delimited text file of several thousand lines. The
| first part is easy;
|
| cut -f8 file

This bit of processing can often be moved into whatever interpreter
does the rest of the processing, but this does neatly separate this
function and may be more maintainable for whomever might inherit it.

| field 8 of this file contains multiple, variable length, sentences enclosed
| in double quotes. Example returned by the cut command above:
|
| "this is sentence one. this i sentence two. this ""is a quote that may be""
| in sentence three."
|
| I need to grab the first sentence for further processing (without the
| period, without the beginning quote mark) into a variable, but am having
| difficulty. Can anyone suggest an easy way to do this? I'm open to
| bash,perl,python solutions, but prefer bash.

Well, sed is the easiest tool for the problem as stated:

cut -f8 file | sed 's/^"//; s/\..*//'

I might choose another tool depending on the rest of the processing
to be done and on whether the problem domain might evolve. The sed
command is a bit brittle relative to changing requirements. Much the
same could be done in bash directly, but the sed command is simpler
and faster.

If the first sentence might contain an embedded period, say in a
literal floating point number, you might match on the ". " string
and then delete any trailing period, which might remain if there
were only one sentence, i.e.:

cut -f8 file | sed 's/^"//; s/\. .*//; s/\.$//'

And, if exclamation and question marks may terminate sentences in
the input:

cut -f8 file | sed 's/^"//; s/[\.!?] .*//; s/[\.!?]$//'

This still doesn't handle ellipsis marks... 8-)
How constrained and well behaved is the input???

Best Regards,
Chuck

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997

ATOM RSS1 RSS2



LISTS.UFL.EDU

CataList Email List Search Powered by the LISTSERV Email List Manager