***** To join INSNA, visit http://www.insna.org *****
In my experience, I tried to find a crawler to suit my needs but in
the end I decided to write my own crawler because I control the code
and what needs to be done. I found that the existing crawlers were
limiting and did not suit to what I was looking for. With a bit of
sample source code, you can easily write a crawler in Python, I found
Python faster and easier to write than C++ or Java.
Alvin
On Fri, Jul 10, 2009 at 11:32 PM, Brian Ulicny<[log in to unmask]> wrote:
> ***** To join INSNA, visit http://www.insna.org *****
>
> Crawlers usually let you specify a crawl depth, so you can control the
> size of the graph to some degree.
>
> Brian
>
> On Fri, Jul 10, 2009 at 11:14 AM, Carl Nordlund<[log in to unmask]> wrote:
>> ***** To join INSNA, visit http://www.insna.org *****
>>
>> Thank you all for your suggestions! I have checked them out - I'll take a
>> closer look at nutch and see whether I can adapt it to suit my needs, but
>> I'm also keen on developing my own crawler client in C#. I'll most likely
>> use a couple of blog portals to identify the most popular and active blogs
>> first, and then scan these for linkages between them. Although it should be
>> possible to let it 'free-crawl', in combination with language identification
>> functions and IP range restrictions, I guess that nevertheless would result
>> in a was too large dataset...
>>
>> Ben Spigel wrote:
>>>
>>> ***** To join INSNA, visit http://www.insna.org *****
>>>
>>> Issuecrawler can export to a UCINET format. I found it a useful tool
>>> for small-scale network analysis, but if you're looking to crawl more
>>> than, say, 200 webpages I think you'd be better off with software that
>>> you can run yourself on your own machine. That way you can customize
>>> it to your needs, and you're not taking over a small non-profit's
>>> bandwidth.
>>>
>>> Ben Spigel
>>> PhD Student
>>> Department of Geography
>>> University of Toronto
>>>
>>> On Fri, Jul 10, 2009 at 10:36 AM, Alvin Chin<[log in to unmask]>
>>> wrote:
>>>
>>>>
>>>> ***** To join INSNA, visit http://www.insna.org *****
>>>>
>>>> As far as I know, when I started to use IssueCrawler, I was able to
>>>> extract the social graph.
>>>>
>>>> Alvin
>>>>
>>>>
>>>> On Fri, Jul 10, 2009 at 9:04 PM, Brian Ulicny<[log in to unmask]>
>>>> wrote:
>>>>
>>>>>
>>>>> ***** To join INSNA, visit http://www.insna.org *****
>>>>>
>>>>> We've used Nutch in our work in analyzing the Malaysian blogosphere at
>>>>> VIStology. Nutch is an open-source, customizable web crawler in Java.
>>>>> See Nutch.org.
>>>>>
>>>>> Nutch works out of the box to crawl websites, but you'll have to do
>>>>> some (fairly easy) customization to extract the link structure.
>>>>>
>>>>> You might want to look at issuecrawler.net for a hosted crawling
>>>>> service. I'm not sure if the link structure is available as output.
>>>>>
>>>>> Best,
>>>>>
>>>>> Brian Ulicny
>>>>> VIStology, Inc.
>>>>> Framingham, MA
>>>>> USA
>>>>>
>>>>> On 7/10/09, Lukas Zenk <[log in to unmask]> wrote:
>>>>>
>>>>>>
>>>>>> ***** To join INSNA, visit http://www.insna.org *****
>>>>>>
>>>>>> Hi Carl,
>>>>>>
>>>>>> if you'd like to crawl e.g. google blogs, you could use the software
>>>>>> Condor:
>>>>>> http://www.galaxyadvisors.com/documents/condor.pdf
>>>>>>
>>>>>> Regards,
>>>>>> Lukas
>>>>>>
>>>>>> ---
>>>>>> Lukas Zenk, PhD.cand.
>>>>>> Member of the scientific staff
>>>>>> Department of Knowledge and Communication Management
>>>>>> Danube University Krems - Austria / Europe
>>>>>> www.donau-uni.ac.at
>>>>>>
>>>>>> On Jul 10, 2009, at 2:56 AM, Carl Nordlund wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ***** To join INSNA, visit http://www.insna.org *****
>>>>>>>
>>>>>>> Hi!
>>>>>>> Inspired by the amazing work done by the ppl at Berkman center at
>>>>>>> Harvard
>>>>>>>
>>>>>>> (http://cyber.law.harvard.edu/publications/2008/Mapping_Irans_Online_Public/interactive_blogosphere_map
>>>>>>>
>>>>>>> ), I've been thinking about how to gather bloggosphere data, i.e.
>>>>>>> the creation of a (national) network dataset in which each node is a
>>>>>>> blog and where the edges/links are the number of directional links
>>>>>>> (external) from-to each pair of blogs. I have started working on a
>>>>>>> php script that recursively crawls a website, check for external
>>>>>>> links, and builds a dataset - this of course has to be combined with
>>>>>>> a check on the nationality of the blog (comparing with national IP
>>>>>>> ranges and/or language analysis of a sample text).
>>>>>>>
>>>>>>> But perhaps I'm trying to invent the wheel again. Are there any
>>>>>>> suitable web crawling software that can do the trick? As I have
>>>>>>> understood it, the consulting firm Morningside Analytics helped the
>>>>>>> Berkman group in their mapping - judging by the rather large
>>>>>>> dataset, I assume that they used some sort of web crawler. Anyone
>>>>>>> knows anything more about this?
>>>>>>>
>>>>>>> Yours,
>>>>>>> Carl Nordlund
>>>>>>> ---
>>>>>>> Carl Nordlund, BA, PhD student
>>>>>>> carl.nordlund(at)hek.lu.se
>>>>>>> Human Ecology Division, Lund university
>>>>>>> www.hek.lu.se
>>>>>>>
>>>>>>> _____________________________________________________________________
>>>>>>> SOCNET is a service of INSNA, the professional association for social
>>>>>>> network researchers (http://www.insna.org). To unsubscribe, send
>>>>>>> an email message to [log in to unmask] containing the line
>>>>>>> UNSUBSCRIBE SOCNET in the body of the message.
>>>>>>>
>>>>>>
>>>>>> _____________________________________________________________________
>>>>>> SOCNET is a service of INSNA, the professional association for social
>>>>>> network researchers (http://www.insna.org). To unsubscribe, send
>>>>>> an email message to [log in to unmask] containing the line
>>>>>> UNSUBSCRIBE SOCNET in the body of the message.
>>>>>>
>>>>>>
>>>>>
>>>>> _____________________________________________________________________
>>>>> SOCNET is a service of INSNA, the professional association for social
>>>>> network researchers (http://www.insna.org). To unsubscribe, send
>>>>> an email message to [log in to unmask] containing the line
>>>>> UNSUBSCRIBE SOCNET in the body of the message.
>>>>>
>>>>>
>>>>
>>>> _____________________________________________________________________
>>>> SOCNET is a service of INSNA, the professional association for social
>>>> network researchers (http://www.insna.org). To unsubscribe, send
>>>> an email message to [log in to unmask] containing the line
>>>> UNSUBSCRIBE SOCNET in the body of the message.
>>>>
>>>>
>>>
>>> _____________________________________________________________________
>>> SOCNET is a service of INSNA, the professional association for social
>>> network researchers (http://www.insna.org). To unsubscribe, send
>>> an email message to [log in to unmask] containing the line
>>> UNSUBSCRIBE SOCNET in the body of the message.
>>>
>>
>>
>> --
>> Carl Nordlund, BA, PhD student
>> carl.nordlund(at)hek.lu.se
>> Human Ecology Division, Lund university
>> www.hek.lu.se
>>
>> _____________________________________________________________________
>> SOCNET is a service of INSNA, the professional association for social
>> network researchers (http://www.insna.org). To unsubscribe, send
>> an email message to [log in to unmask] containing the line
>> UNSUBSCRIBE SOCNET in the body of the message.
>>
>
> _____________________________________________________________________
> SOCNET is a service of INSNA, the professional association for social
> network researchers (http://www.insna.org). To unsubscribe, send
> an email message to [log in to unmask] containing the line
> UNSUBSCRIBE SOCNET in the body of the message.
>
_____________________________________________________________________
SOCNET is a service of INSNA, the professional association for social
network researchers (http://www.insna.org). To unsubscribe, send
an email message to [log in to unmask] containing the line
UNSUBSCRIBE SOCNET in the body of the message.
|