***** To join INSNA, visit http://www.insna.org *****
This has been a PITA for me.
Altho 90% would be fixed if the psychologists, comm sci folks, etc stopped
using initials only.
Barry Wellman S.D. Clark Professor of Sociology NetLab Director
Centre for Urban & Community Studies University of Toronto
455 Spadina Avenue Toronto Canada M5S 2G8 fax:+1-416-978-7162
wellman at chass.utoronto.ca http://www.chass.utoronto.ca/~wellman
for fun: http://chass.utoronto.ca/oldnew/cybertimes.php
New system solves the 'who is J. Smith' puzzle
Thursday, December 14, 2006
Penn State researchers have developed an automated system that can
determine which "J. Smith" is authoring papers on computer science -- the
one who teaches at Penn State or the one who teaches at M.I.T -- as well
as whether "J. Smith" is John Smith, Jane Smith, Joanna L. Smith or James
The system, which retrieves classes of authors with similar names,
considers not just names in making its determination but also other
information such as co-authors, dates of publications, citations and
When tested with 3,355 academic papers written by 490 authors, the system
correctly identified authors 90.6 percent of the time.
"It works very similarly to how humans would figure out authors' identity
-- by looking at affiliations, topics, publications," said C. Lee Giles,
the David Reese professor of Information Sciences and Technology and
"The system works by using machine-learning methods to cluster together
names that the system believes to be similar. If you think there’s another
parameter that's relevant, you can change the algorithm and include it,"
The system is explained in a paper, "Efficient Name Disambiguation for
Large-Scale Databases," presented at the recent 17th European Conference
on Machine Learning and the 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases in Berlin. Co-authors were
Jian Huang, a doctoral student in the College of Information Sciences and
Technology, and Seyda Ertekin, a doctoral student in the Department of
Computer Science and Engineering.
Even in academic publications, figuring out an author's identity can be
difficult as publications vary in how individuals' names are presented.
For instance, some publications opt just for first initial and last name
as in "J. Smith." Others include full name -- C. Lee Giles, for instance.
But if the surname is common, as in "Smith" or "Chen," first names may not
suffice to accurately identify the author.
Confusion also can occur because of how entities are listed with some
publications choosing Penn State, The Pennsylvania State University or
PSU. The researchers' algorithm can clear up ambiguities surrounding
entities whether institutions, businesses, funding agencies or
"This method will work on many entity disambiguation problems," Giles
The algorithm uses a clustering method to train computers to extract
information based on similar properties. Each time information is
clustered, the result is a smaller and smaller grouping.
The algorithm will be a part of the next generation CiteSeer, the largest
academic search engine for computer and information-science literature.
Giles was co-creator of CiteSeer when he was at NEC.
The research was supported by the National Science Foundation and
[log in to unmask]
[log in to unmask]
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.insna.org). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.