WoWInterface

WoWInterface (https://www.wowinterface.com/forums/index.php)
-   General Authoring Discussion (https://www.wowinterface.com/forums/forumdisplay.php?f=20)
-   -   website scraping help please? (https://www.wowinterface.com/forums/showthread.php?t=44941)

myrroddin 10-25-12 09:41 AM

website scraping help please?
 
I was talking to the staff at wowdb.com, and they gave permission for me to scrape their mailbox data. wowhead.com does not track mailboxes, if people wondered.

My problem is that I have no idea how to scrape the data because I have never done any scraping. I looked at DataTools on Curse, and I got headaches.

If someone wants to teach me, I would like to learn how DT makes an addon, populates it with data from wowhead or wowdb or the game's files, etc. On the other hand, if someone wants to present me with "here you go", I'll take that happily as well. The format I want the data is something like the following example, as I will need to make corrections like mailboxes in the wrong location, put icons on the correct floors, etc.

The LocationMapper line for the mailbox link above starts on line 7067, and I understand it doesn't follow the table format I need. If I could even break it down into something readable and then copy/paste, that's fine too.

Lua Code:
  1. -- HandyNotes_PostService.lua
  2. local HPS = LibStub("AceAddon-3.0"):NewAddon("HandyNotes_PostService")
  3.  
  4. do
  5.     function HPS:ParseData()
  6.         local mailboxes = HPS:Data()
  7.         -- blah blah
  8.     end
  9. end
  10.  
  11. function HPS:OnIntialize()
  12.     -- blah
  13. end
  14.  
  15. -- PostService_Mailboxes.lua
  16. local HPS = LibStub("AceAddon-3.0"):GetAddon("HandyNotes_PostService")
  17.  
  18. function HPS:Data()
  19.     local mailboxes = {
  20.         ["StormwindCity"] = { -- Astrolabe's mapFile name, but it could be Blizzard's mapID and I can convert
  21.             [0] = "1|62207440|" -- [mapFloor] = "factionNum|coord|factionNum|coord|" where Alliance = 1, Horde = 2, Other = 3
  22.         }
  23.     }
  24.     return mailboxes
  25. end

myrroddin 10-25-12 09:42 AM

If the game client itself keeps track of the locations, that would be awesome, but I don't think it does.

Barjack 10-25-12 10:57 AM

I wouldn't really call this a scraping problem, but I guess you could call this sort of data conversion part of a scraping problem. It seems like all the data you need exists in the table passed to that "Mapper" call, all you need to do is convert it to a format you can use in Lua.

If you want to do this in an automated way, you'll probably want to use some sort of scripting language that can load JSON, load that table in there, parse it that way, and output some Lua table you can include in your addon. Perhaps a local install of Lua can do things like this, but I really don't know. I imagine languages like Python, Perl or Ruby would be how most people go about things like that.

I looked at http://static-azeroth.cursecdn.com/1...574/js/core.js to understand what the "pins" array is. Running the Mapper function through a pretty printer shows this:
Code:

        function f(G) {
                var F = G >> 9;
                var H = G & 511;
                return [F / 5, H / 5]
        }

This is what turns a "pin" into a coordinate pair. For example the "pin" 83265 in Ironforge results in x = (83265 >> 9) / 5 = 32.4, and y = (83265 & 511) / 5 = 64.2. That results in a pin at 32.4,64.2 which is correct. You could probably do that part in Lua or in your pre-processing, whichever works best for you.

Also it may be worth noting that if this data isn't something you'll need to convert often, you could probably do some amount of work just with some regular expression find-and-replace on that huge table, instead of loading it as JSON and running the tree, etc. But this might make converting pins more difficult if that is needed in the pre-Lua stage.

As for stuff like how to convert its "floors" and zone names to something easier for you to use, I'm not sure what your options are. There may be Lua libraries or something to help you there, but there may not be either.

SDPhantom 10-25-12 11:35 AM

Quote:

Originally Posted by myrroddin (Post 267676)
If the game client itself keeps track of the locations, that would be awesome, but I don't think it does.

There is a tracking option for mailboxes in the default UI, it adds icons for them on the minimap when you're close to one.

Phanx 10-25-12 06:23 PM

That's completely useless for the OP's purpose, though, as those (1) only appear when you are in minimap range of the mailbox, and (2) are not accessible by addons in any way.

myrroddin 10-26-12 07:04 AM

Worst case I can perform data entry by mousing over icons per zone and floor. It will take some time, but it will work. I was just hoping to learn if someone was willing and able to teach.

Phanx 10-26-12 05:33 PM

Quote:

Originally Posted by Barjack (Post 267681)
It seems like all the data you need exists in the table passed to that "Mapper" call, all you need to do is convert it to a format you can use in Lua. ... Perhaps a local install of Lua can do things like this ...

Definitely. If someone can post the table, I can convert it for you or give you a Lua script to convert it. However, I've never done anything remotely related to JSON or website scraping, so I have no idea how to obtain said table.

Saiket 10-27-12 12:52 AM

1 Attachment(s)
I've done a lot of similar parsing for _NPCScan.Overlay, so I rearranged my Python 3 scripts to pull your mailbox locations into a Lua source file. I left out my MPQ parsing code since it requires you to build a DLL, so this version reads DBC files directly. The attached zip contains:
  • WorldMapArea.dbc extracted for you; May need to get the latest version from your data files if new maps with mailboxes get added in a patch.
  • mailboxes.py - The actual scraping script.
  • dbc.py - A simple module for reading DBC files.
  • mailboxes.bat - Windows batch file to run the above script with default parameters.
  • mailboxes.lua - The sample output file my run created.

Note that you'll need Python 3.2+ to run the scripts, and you must install BeautifulSoup4 to interpret WoWDB's HTML.

Here's how it works in summary:
  1. Download the raw HTML for object 142075 (mailbox) from WoWDB.
  2. Interpret the text as HTML using a forgiving XML parser.
  3. Search the resulting document tree for a div with ID "mapper-container" with BeautifulSoup.
  4. The script tag following that div contains JavaScript defining map points. Strip off the "Mapper" constructor call with a regex, and parse the contained argument table as JSON.
  5. The table's contents are pretty straight-forward, but maps are represented by their AreaTable IDs (no WoW API exposes these) instead of their MapArea IDs (what you get from GetCurrentMapAreaID). This is where WorldMapArea.dbc comes in to convert to IDs you can use within WoW.
  6. Write it.

Feel free to ask any questions about the script if it interests you. If not though, I think the included Lua source should be good enough to use.

myrroddin 10-27-12 09:58 AM

Thank you. I'm going to take a poke at this. Mailboxes are a first step; I want to eventually parse out NPCs who repair, train classes, train skills, and vend certain things. But I need to learn how to parse the data first.

myrroddin 10-27-12 10:17 AM

Wait... "Attached zip".... did I miss something, because I don't see one on your post.

Seerah 10-27-12 11:37 AM

I think he just forgot it. ;)

Saiket 10-27-12 02:53 PM

Oops, I had added it to the attachment manager window, but forgot to hit the "upload" button. I've added it to my original post in an edit.

myrroddin 10-27-12 09:07 PM

That looks.... AWESOME!! :eek::banana::cool::D As you are correct, WoWDB does not break mailboxes into factions, which means I would have to copy and edit, saving one as the "original backup for updating". No big deal there.

I have Python 3.3.x 64 bit installed, but even after reading the page's instructions, I could not figure out how to install BeautifulSoup. Also, when I want to parse an update, do I run the batch file, or mailboxes.py? I am guessing the batch file, as its code looks like it creates the latter file.

As for floors, I noticed that Dalaran lists 1 and 2, Ogrimmar is 0 and 1, and the Shrine of Two Moons is 1 and 2 but Shrine of Seven Stars is 3 and 4. Is that a parse issue, or does the game return those values for those zones' floors? Just curious if I need to edit those, or if they are correct, yet odd.

Two last questions for now: how do/should I look at WorldMapArea.dbc, and if the game client does not save mailbox locations in its cache, what is this file used for?

Vlad 10-27-12 11:14 PM

Quote:

Originally Posted by myrroddin (Post 267863)
As for floors, I noticed that Dalaran lists 1 and 2, Ogrimmar is 0 and 1, and the Shrine of Two Moons is 1 and 2 but Shrine of Seven Stars is 3 and 4. Is that a parse issue, or does the game return those values for those zones' floors? Just curious if I need to edit those, or if they are correct, yet odd.

That's because the zone has 4 floors, 2 for horde, 2 for alliance, the 0 one is the regular map. They just used the same areaID the zone and the floors for the capitals rather than having more areaID, just for the city floors. ;)

myrroddin 10-28-12 02:23 AM

That would make sense if the Shrines did not have different mapIDs, but they do. I guess if I want accurate data, I would skip mapID[811] Vale of Eternal Blossoms and stick to mapID[903] Shrine of Two Moons and mapID[905] Shrine of Seven Stars. When plugging data into Astrolabe and HandyNotes, [903][1] and [903][2] are correct for Moon's floors, while [905][3] and [905][4] are correct for Stars'? Or does Astrolabe use floors [1][2] for Stars'? I will have to test I suppose.

The reason I'm asking is because right now, the HandyNotes plugins for Innkeepers, vendors, trainers, bankers, etc use [811] as their mapID, which is not correct for either city, and it messes up the zone map and each of the city map floors. The icons are all in weird places, and I want to avoid that if possible.

Hey, it occurs to me to wonder, why is there eight coordinates rather than six? More accurate, yes, but if I wanted to use user data for missing mailboxes, all the coordinate addons I've seen read as 66.5, 47.2 and not 66.57, 47.21. To further give me questions, the example for GetPlayerMapPosition() uses even longer numbers, and between 0 and 1 at that.

Vlad 10-28-12 07:47 AM

Actually I assumed that the horde city areaID was the same as the map, because that is the case for the alliance city, hehe.

Regarding coordinate precision, most have one decimal because it's close enough, but you should use two decimals if you want to be precise, it just takes more space to store that extra digit. ;)


All times are GMT -6. The time now is 12:33 PM.

vBulletin © 2024, Jelsoft Enterprises Ltd
© 2004 - 2022 MMOUI