Print

Print


*****  To join INSNA, visit http://www.insna.org  *****

Distributional clustering tells us that words that appear in similar
contexts share some semantics. For example, consider the query
"studied at". The first 10 hits returned by Google contain these
examples:

... studied at Texas A&M ...
... studied at UR ...
... studied at Mallinckrodt Institute ...
... studied at the FU Berlin ...
... studied at Harvard ...
... studied at UC Davis ...

Similarly, given the query "at Harvard University", one gets:

... library preservation programs at Harvard University ...
... The Department of Near Eastern Languages & Civilizations at
Harvard University ...
... Center for International Development at Harvard University ...
... The Civil Rights Project at Harvard University ...

A simple algorithm based on distributional clustering could work as
follows:

Input: set S1 of phrases
Output: set S2 of phrases such that S2 is a superset of S1

Algorithm:

1. consider the relation P//Q (P appears in the context of Q). For
   example "studied at" appears in the context of "Texas A&M".

2. build a bipartite graph that corresponds to P//Q

3. identify a small but dense subgraph G of P//Q such that the right
   hand mode of G is equal to S1. The left hand mode of G is the the
   set of phrases that cooccur with S1.

4. find the projection G' of G onto the right mode. Rank the resulting
   phrases by the strength of the projection. Let S2 be the set of the
   highest ranking ones such phrases.

Illustration:

S1 = {Harvard, Princeton, Stanford, Yale}

G = {"studied at"//Harvard
     "studied at"//Princeton
     "president of"//Harvard
     "president of"//Yale
     "provost of"//Stanford
     "provost of"//Yale}

G' = G union {"studied at"//Oxford
              "president of"//Columbia}

S2 = {Harvard, Princeton, Stanford, Yale, Oxford, Columbia}


--
Drago


> 
> *****  To join INSNA, visit http://www.insna.org  *****
> 
> Hi, John,
> 
> I think this is a fascinating but difficult question especially when
> the stake is to guess an answer which is hermetically sealed by Google
> as a commercial secret. I remember once I've made a google search on
> that and the only things I found were a couple of blogs in
> questsin.blogspot.com:
> 
> http://questsin.blogspot.com/2005/05/reverse-engineering-google-sets.html
> http://questsin.blogspot.com/2005/06/generic-algorithm-for-classification.html
> http://questsin.blogspot.com/2005/08/dealing-with-clustering-of-common.html
> 
> but I didn't find very convincing the idea of just parsing inside
> lists and tables. There should be something more to it.
> 
> Now in terms of your hints, it's hard to judge. My feeling is that
> here we're dealing more with lexical semantics (e.g., recognizers over
> ontologies) than relational semantics (as static interpretation
> schemes although this might be employed as an inference tool in sets).
> 
> I would also like to know more.
> 
> Regards,
> 
> --Moses
> 
> 
> On 10/1/06, Scott, John <[log in to unmask]> wrote:
> > *****  To join INSNA, visit http://www.insna.org  *****
> >
> > Barry Wellman recently discovered a possible network algorithm in
> > Google. Under 'More' on the Google search page you can find 'Google
> > Labs', and under that is 'Google Sets'. Google sets takes the first two
> > items in a series and grows it into a network of connections. It does
> > this through Google searches on the items.
> >
> > For example, Barry input 'Barry Wellman' and 'Beverley Wellman'. Google
> > responded with the following network:
> >
> > Barry Wellman
> > Ronald S Burt
> > Alain Degenne
> > John Scott
> > Nan Lin
> >
> > Those of us in Barry's Google set had a brief discussion about what the
> > algorithm might be and whether there were any social network analysis
> > applications of the search tool. Our best guess was that the algorithm
> > was a search on the starter items that looked at a frequency count of
> > words appearing in the searches before choosing the next search item.
> > Crucial for this particular set seems to be book purchasing choices
> > through Amazon. Searches on the names come up with Amazon and similar
> > pages that have lots of built in cross-references such as pages recently
> > consulted, people who bought X also bought Y, if you liked A, then you
> > will like B, and so on.
> >
> > It looks as if you need to have the two starter names connected in some
> > way that Google will find easily in order to generate a meaningful set.
> > Inputting two loosely connected names seems to produce a large and
> > apparently meaningless set. It would be interesting to know the actual
> > search algorithm to see if it is coming up with n-cliques, clusters, or
> > whatever,. It might then be a method for generating some sociometric
> > data. Are there any suggestions out there about what the algorithm is
> > and how it might be used?
> >
> > To get you thinking, here are a couple of other Google-generated
> > networks.
> >
> > Inputting the names of the editors of the ASR and the AJS (Jerry Jacobs
> > and Andrew Abbott) generated this list:
> >
> > Jerry Jacobs
> > Andrew Abbott
> > Wayne C Booth
> > Charles Tilly
> > Erving Goffman
> > Robert R Alford
> > Robert K Yin
> >
> > What are the links there? And are they sociometrically relevant?
> >
> > Starting the list with INSNA and the ASA generated a list of drug and
> > airline names, suggesting that some search items become crucial pivots -
> > because of their frequency score - in determining the direction of the
> > network: ASA shifts the network in the direction of drugs, while SAS
> > shifts it in the direction of airlines. (ASA is, apparently, a generic
> > brand name for aspirin in some countries). Here is the full set:
> >
> > INSNA
> > ASA
> > acetylsalicylic acid
> > salicylsalicylic acid
> > ephedrine
> > phenobarbital
> > rifampin
> > phenytoin
> > indomethacin
> > SSS
> > SAS
> > ASHTARAK KAT
> > American Eagle
> > Northwest Airlink
> >
> > I'd be interested to know what people think is going on in these
> > networks and what uses (if any) they think Google Sets might have in
> > social network analysis.
> >
> > John Scott
> >
> > --------------------------------------------------------
> >
> > Professor John Scott
> > Department of Sociology,
> > University of Essex,
> > Colchester CO4 3SQ
> > United Kingdom
> >
> > 01206-872640
> > Web site: http://privatewww.essex.ac.uk/~scottj/
> >
> >
> >
> > _____________________________________________________________________
> > SOCNET is a service of INSNA, the professional association for social
> > network researchers (http://www.insna.org). To unsubscribe, send
> > an email message to [log in to unmask] containing the line
> > UNSUBSCRIBE SOCNET in the body of the message.
> >
> 
> _____________________________________________________________________
> SOCNET is a service of INSNA, the professional association for social
> network researchers (http://www.insna.org). To unsubscribe, send
> an email message to [log in to unmask] containing the line
> UNSUBSCRIBE SOCNET in the body of the message.
> 
> 


-- 
Dragomir R. Radev                    Associate Professor
SI, CSE, Ling                     U. Michigan, Ann Arbor 
http://www.eecs.umich.edu/~radev         [log in to unmask]              

_____________________________________________________________________
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.insna.org). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.