***** To join INSNA, visit http://www.sfu.ca/~insna/ *****
>I'm currently working on a research employing social network analysis
>on an e-mail archive. My problem is how to extract automatically from
>the "from", "to" and "cc" fields of EACH e-mail a relational matrix.
>Does anybody of you have an idea or suggestion (some software or short
I had to solve a similar problem myself and found it necessary to write
the software to do it. (I was extracting networks showing who is
co-mentioned with whom in about 60,000 single-spaced pages of archived
As other posters have suggested, the task is easier if you can put a
decent number of e-mails together in a largeish file.
Once you have the file, read it closely and develop a graph that
captures the syntax of the fields you care about. The main thing is
that you need to know when messages begin and end and what you want from
them. Since the stuff you care about occurs after well-defined tokens
that come at the beginnging of lines, the parsing logic is pretty
simple. A super-simple version would go something like this:
1. Get a line from your input file.
2. Until it starts with "From:", dump it and get another.
3. Extract everything to the end of line as the from data
4. Get a line from your input file.
5. Until it starts with "To:", dump it and get another.
6. Extract everything to the end of line as the from data
Of course, you'll want to tease apart multiple recipients, and you may
want other fields as well. The more you may change what you want, the
more it makes sense to specify what you are looking for by using a graph
that describes the order of the fields you care about.
Say you wanted to allow for link decay and you wanted to capture notions
of principals, stakeholders and confidantes, you'd also want the date,
cc: and bcc: fields. We can write the graph as text by putting each
token you care about at the beginning of a line and then putting the
tokens that could follow on the same line. That way, your syntax graph
could look like this:
To: CC: BCC: Reply-to: From:
CC: BCC: Relpy-To: From:
BCC: Reply-To: From:
(This is all linear text, so you don't have to worry about cycles.)
Just as above, if you only care about the field info that follows the
token marking the data, you can just get lines and eat them if they
don't start with tokens that match nodes in your syntax graph. Doing
this makes it possible to change what you are looking for without
rewriting your code.
If you're not programming-savvy, CS undergrads who have taken compilers
and done well can usually handle this kind of task without too much
agony. Perl, Java, C++ and even Visual Basic are all reasonable
languages for the job.
If you are developing very many networks with techniques like this, I
recommend you embed routines for computing the network measures you want
into your program. Saves a lot of pointing and clicking in UCINet,
Pajek, or whatever else. For that, I recommend NetStat+ if you are
using C or C++ (that's what I did -- it words great) or JUNG if you are
Hope that helps and good luck,
Mark T. Kennedy, Ph.D.
Department of Management & Organization
Marshall School Business | University of Southern California
[log in to unmask]
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.sfu.ca/~insna/). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.