View Single Post
04-06-15, 04:30 PM   #4
Barjack
A Black Drake
AddOn Author - Click to view addons
Join Date: Apr 2009
Posts: 89
I tried contacting Wowhead a couple of times about scraping their data without much luck (never got a response) and in the end I just decided to go ahead and do it since I wasn't going to use it for any nefarious purpose.

In my case I need item information, so it's a far bigger scrape than what you're doing. I need both the XML and HTML versions of each page, and there are a lot of items in WoW (almost 100,000). So I end up making about 200,000 requests. I wrote a custom Ruby script to download them all that can use any number of simultaneous threads. On my cable connection and if I set the thread count to 20, the XML half only takes about an hour but the HTML half takes about 3 hours.

Then I parse them all using another Ruby script using Nokogiri (XML/HTML parser) and simple regular expressions. The site uses AJAX often so there are often JSON objects sitting around on various pages that you can load into memory with a JSON parser, too. These are sometimes more convenient than scraping the HTML.
  Reply With Quote