***** To join INSNA, visit http://www.sfu.ca/~insna/ *****
My first day on this list, but I think this is an appropriate forum
to announce a website I first built three years ago, using
Amazon's affinity info on customers' buying patterns. In early 2000
I gathered info on 3000 musical artists; a year later I picked up
about 5000 artists, with about 20,000 links. Then Amazon changed
their format slightly, a sys-admin rearranged files on my machine,
I was busy with work, and I left the site alone, languishing with
data that got more out-dated every week.
Then last summer Amazon announced their Web Services program AWS, which
made their data available directly without having to parse HTML.
So I brought my server back to life, wrote a couple of programs, and
now have a database of about 440,000 items and 3.6 million links.
The items currently cover books, CDs, and videos.
The site is http://www.baconizer.com/ -- Jon Udell wrote it up a
month ago in his infoworld.com blog, along with a paragraph on
Valdis Krebs' work. I haven't found any business applications
with my site; it's basically a cool diversion. I'm reading up
on clustering algorithms, but haven't implemented anything yet.
But using the site definitely shows clusters. A query like
will make this obvious.
The Baconizer draws shortest paths between any two nodes in its
database. In April's 400,000 nodes, 380,000 were in one component, which
meant that you could reach any two of them. Given two nodes A and B
in two different components, there might be a path from A to B, or
from B to A, but not both -- then they would be in the same
component. I haven't calculated the May data's components, but
I've noticed that items like Buffy videos that were in a remote
component are now in the main one.
Computationally, finding a shortest path is easy -- it's about
50 lines of C code. The clustering is harder, but I can run
a process for days, if that's what it takes. The next part is
labeling the clusters. I should look to see if Amazon is returning
genre information now. I wrote the gather against version 1 of
the AWS, and they're on version 3 now.
I update the data on a monthly basis -- one of the terms of using the
Amazon web service is that a developer shouldn't make more than
one query per second, which means pulling down 400,000 items
will take about five days, allowing for network timeouts/retries.
It turns out to be much more than that, as I don't throttle their
web service at one request/second, as the live web site uses it
The numerical data is interesting. For one thing, the links are
directed, and their distribution follows a power law, with a
factor of around -2 if I recall correctly. I seed the graph
with albums by the Bacon Brothers, and every month reach about
400,000 items. This time I merged the May data with the April --
40,000 nodes from April weren't found, but 40,000 new ones were.
This fits with an aging model for the sort of items Amazon sells.
I just merge the data together to make the site richer, but keep
the original data sets intact.
The average shortest distance is around 11, with the longest path
at 53 for May. The nodes at the center tend to be history textbooks,
blues albums, and mainstream best-sellers. At the periphery
I find genres, like romance, pre-teen girls' series like the
Babysitter's Club, TV series videos, books on comic books and
Japanese anime and manga characters, live CDs from prolific bands
like Phish and Pearl Jam, and various kinds of Bob Marley remixes.
It gives an interesting view of North American pop culture. I'm
wondering if anyone here is interested in pursuing this as a
research tool. Also, unlike many of the other networks I've
seen, this one is built based on what people are doing with their
hard-earned cash right now. I find the patterns that emerge from
that kind of data more interesting than which actors worked with
which others, which are drier facts. That didn't stop me from
naming the site after the Oracle of Bacon.
I'm not a researcher -- I work for a small software company
building development tools that let programmers use open-source
languages in the new Windows .NET environment. About the only
application of the Baconizer for my work is that I now can spot a
cyclic graph from miles away.
[log in to unmask]
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.sfu.ca/~insna/). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.