Open Source Web Crawlers, any advice?
11 08 Jul 2015 20:54 by u/shazzbot
Hi all,
I'm looking into to making a low latency web crawler to crawl for "breaking news" posts from a website.
I've done some research and seen Apache Nutch and Storm Crawler.
Apache Nutch looks to be quite a mature project (5+ years old), whereas Storm Crawler seems to be (+1 years old).
Also Storm Crawler is an SDK, so will require some configuration and development to get it up and running,
Compared to Apache Nutch which will work out of the box (after some tinkering).
However, Storm Crawler seems to be lower latency and seems to have nicer documentation.
Have any of you guys used either of these technologies? Or have any of you guys got any other suggestions?
Any advice/tips or tricks would be much appreciated!
Thanks,
Shazzbot
10 comments
3 u/Psycoth 08 Jul 2015 21:26
I don't have too much experience with it, but a short while back I did write a quick and dirty web scraper using Scrapy. As I recall it was fairly easy to impliment.
0 u/shazzbot [OP] 08 Jul 2015 22:12
Thanks for the link! I'll check it out :)
1 u/YoukaiCountry 08 Jul 2015 23:43
This is my go-to for writing crawlers. It's really simple to learn, and coding in Python is always a pleasure.
2 u/b3k 08 Jul 2015 21:43
I used Nutch for a project last year. There is a little learning curve and it requires some configuration, but as you said it's mature. It works quite well for webcrawling and can be setup for automatically pulling metadata and full text out of crawled documents and can easily dump the crawl results into an index like Elasticsearch.
That being said, it sounds like you're targeting a single website looking for specific information, which seems like more of a scraping task than a crawling task. I'd look at Scrapy.
0 u/shazzbot [OP] 08 Jul 2015 22:11
Yeah my first Proof of concept will work with just one website, so I might be running before I can walk, so to speak. But I would really like to scale up from 1 to a larger number of websites.
What did you end up using Nutch for? I managed to do a test run last night on a news website but the results were a bit hit and miss. Everything apart from the title was in the content tag as one long string.
Do you know if its possible to create a custom parser, so I would be able to parse the raw data (not sure what format it is) ?
0 u/b3k 09 Jul 2015 15:38
I used Nutch to map a corporate intranet and feed the data into Elasticsearch. Nutch, to me, seems like a better tool when you don't have specific sites targetted, though it can work for that.
It's been a year since I used it, but I don't remember this being a problem. Like I said, I fed the data right into ES. There was already a plugin (either for Nutch or ES) for parsing that data properly into the index.
0 u/shazzbot [OP] 09 Jul 2015 17:17
Ah I see that makes sense! Thanks!
1 u/lenwood 09 Jul 2015 22:11
Python's Requests library is the easiest thing I've ever used for scraping. Usually 2-3 lines of code for Requests and an XPath statement gets me exactly what I need. I've used these for half a dozen or so projects, very easy to work with. They're also both mature (stable) projects.