*****  To join INSNA, visit  *****


The following may be of interest to this listserv.  The content is from:

Peter Hook

PhD Student, SLIS, Indiana University--Bloomington

Enron Email Dataset

This dataset was collected and prepared by the CALO Project <>  (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web <> , by the Federal Energy Regulatory Commission <>  during its investigation.

The email dataset was later purchased by Leslie Kaelbling <>  at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio <> , worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form [log in to unmask] whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to [log in to unmask] when no recipient was specified.

I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)

Some experiments associated with this data are described on Ron Bekkerman <> 's home page. Also, a paper describing it <>  was presented at the 2004 CEAS conference. <>

*       March 2, 2004 Version of dataset (about 400Mb, tarred and gzipped).


William W. Cohen, CALD, CMU
Last modified: Thu Aug 26 15:50:10 EDT 2004

SOCNET is a service of INSNA, the professional association for social
network researchers ( To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.