website scraping help please?
I was talking to the staff at wowdb.com, and they gave permission for me to scrape their mailbox data. wowhead.com does not track mailboxes, if people wondered.
My problem is that I have no idea how to scrape the data because I have never done any scraping. I looked at DataTools on Curse, and I got headaches. If someone wants to teach me, I would like to learn how DT makes an addon, populates it with data from wowhead or wowdb or the game's files, etc. On the other hand, if someone wants to present me with "here you go", I'll take that happily as well. The format I want the data is something like the following example, as I will need to make corrections like mailboxes in the wrong location, put icons on the correct floors, etc. The LocationMapper line for the mailbox link above starts on line 7067, and I understand it doesn't follow the table format I need. If I could even break it down into something readable and then copy/paste, that's fine too. Lua Code:
|
If the game client itself keeps track of the locations, that would be awesome, but I don't think it does.
|
I wouldn't really call this a scraping problem, but I guess you could call this sort of data conversion part of a scraping problem. It seems like all the data you need exists in the table passed to that "Mapper" call, all you need to do is convert it to a format you can use in Lua.
If you want to do this in an automated way, you'll probably want to use some sort of scripting language that can load JSON, load that table in there, parse it that way, and output some Lua table you can include in your addon. Perhaps a local install of Lua can do things like this, but I really don't know. I imagine languages like Python, Perl or Ruby would be how most people go about things like that. I looked at http://static-azeroth.cursecdn.com/1...574/js/core.js to understand what the "pins" array is. Running the Mapper function through a pretty printer shows this: Code:
function f(G) { Also it may be worth noting that if this data isn't something you'll need to convert often, you could probably do some amount of work just with some regular expression find-and-replace on that huge table, instead of loading it as JSON and running the tree, etc. But this might make converting pins more difficult if that is needed in the pre-Lua stage. As for stuff like how to convert its "floors" and zone names to something easier for you to use, I'm not sure what your options are. There may be Lua libraries or something to help you there, but there may not be either. |
Quote:
|
That's completely useless for the OP's purpose, though, as those (1) only appear when you are in minimap range of the mailbox, and (2) are not accessible by addons in any way.
|
Worst case I can perform data entry by mousing over icons per zone and floor. It will take some time, but it will work. I was just hoping to learn if someone was willing and able to teach.
|
Quote:
|
1 Attachment(s)
I've done a lot of similar parsing for _NPCScan.Overlay, so I rearranged my Python 3 scripts to pull your mailbox locations into a Lua source file. I left out my MPQ parsing code since it requires you to build a DLL, so this version reads DBC files directly. The attached zip contains:
Note that you'll need Python 3.2+ to run the scripts, and you must install BeautifulSoup4 to interpret WoWDB's HTML. Here's how it works in summary:
Feel free to ask any questions about the script if it interests you. If not though, I think the included Lua source should be good enough to use. |
Thank you. I'm going to take a poke at this. Mailboxes are a first step; I want to eventually parse out NPCs who repair, train classes, train skills, and vend certain things. But I need to learn how to parse the data first.
|
Wait... "Attached zip".... did I miss something, because I don't see one on your post.
|
I think he just forgot it. ;)
|
Oops, I had added it to the attachment manager window, but forgot to hit the "upload" button. I've added it to my original post in an edit.
|
That looks.... AWESOME!! :eek::banana::cool::D As you are correct, WoWDB does not break mailboxes into factions, which means I would have to copy and edit, saving one as the "original backup for updating". No big deal there.
I have Python 3.3.x 64 bit installed, but even after reading the page's instructions, I could not figure out how to install BeautifulSoup. Also, when I want to parse an update, do I run the batch file, or mailboxes.py? I am guessing the batch file, as its code looks like it creates the latter file. As for floors, I noticed that Dalaran lists 1 and 2, Ogrimmar is 0 and 1, and the Shrine of Two Moons is 1 and 2 but Shrine of Seven Stars is 3 and 4. Is that a parse issue, or does the game return those values for those zones' floors? Just curious if I need to edit those, or if they are correct, yet odd. Two last questions for now: how do/should I look at WorldMapArea.dbc, and if the game client does not save mailbox locations in its cache, what is this file used for? |
Quote:
|
That would make sense if the Shrines did not have different mapIDs, but they do. I guess if I want accurate data, I would skip mapID[811] Vale of Eternal Blossoms and stick to mapID[903] Shrine of Two Moons and mapID[905] Shrine of Seven Stars. When plugging data into Astrolabe and HandyNotes, [903][1] and [903][2] are correct for Moon's floors, while [905][3] and [905][4] are correct for Stars'? Or does Astrolabe use floors [1][2] for Stars'? I will have to test I suppose.
The reason I'm asking is because right now, the HandyNotes plugins for Innkeepers, vendors, trainers, bankers, etc use [811] as their mapID, which is not correct for either city, and it messes up the zone map and each of the city map floors. The icons are all in weird places, and I want to avoid that if possible. Hey, it occurs to me to wonder, why is there eight coordinates rather than six? More accurate, yes, but if I wanted to use user data for missing mailboxes, all the coordinate addons I've seen read as 66.5, 47.2 and not 66.57, 47.21. To further give me questions, the example for GetPlayerMapPosition() uses even longer numbers, and between 0 and 1 at that. |
Actually I assumed that the horde city areaID was the same as the map, because that is the case for the alliance city, hehe.
Regarding coordinate precision, most have one decimal because it's close enough, but you should use two decimals if you want to be precise, it just takes more space to store that extra digit. ;) |
All times are GMT -6. The time now is 12:33 PM. |
vBulletin © 2024, Jelsoft Enterprises Ltd
© 2004 - 2022 MMOUI