Thread Tools Display Modes
09-17-15, 09:21 PM   #1
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Pattern matching optimization

My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:
Lua Code:
  1. -- Byte table with forbidden characters
  2. local splitByte = {
  3.      [1]   = true, -- no idea
  4.      [10]  = true, -- newline
  5.      [32]  = true, -- space
  6.      [34]  = true, -- ""
  7.      [35]  = true, -- #
  8.      [37]  = true, -- %
  9.      [39]  = true, -- '
  10.      [40]  = true, -- (
  11.      [41]  = true, -- )
  12.      [42]  = true, -- *
  13.      [43]  = true, -- +
  14.      [44]  = true, -- ,
  15.      [45]  = true, -- -
  16.      [46]  = true, -- .
  17.      [47]  = true, -- /
  18.      [48]  = true, -- 0
  19.      [49]  = true, -- 1
  20.      [50]  = true, -- 2
  21.      [51]  = true, -- 3
  22.      [52]  = true, -- 4
  23.      [53]  = true, -- 5
  24.      [54]  = true, -- 6
  25.      [55]  = true, -- 7
  26.      [56]  = true, -- 8
  27.      [57]  = true, -- 9
  28.      [58]  = true, -- :
  29.      [59]  = true, -- ;
  30.      [60]  = true, -- <
  31.      [62]  = true, -- >
  32.      [61]  = true, -- =
  33.      [91]  = true, -- [
  34.      [93]  = true, -- ]
  35.      [94]  = true, -- ^
  36.      [123] = true, -- {
  37.      [124] = true, -- |
  38.      [125] = true, -- }
  39.      [126] = true, -- ~
  40. }
  41.  
  42. local n = CodeMonkeyNotepad -- the editbox
  43. local text = n:GetText() -- get the full text string
  44. local space = strchar(32)
  45.  
  46. -- Replace with space
  47. for k, v in pairsByKeys(splitByte) do
  48.      -- treat numbers differently
  49.      if k < 48 or k > 57 then
  50.           text = text:gsub("%"..strchar(k), space)
  51.      else
  52.           text = text:gsub(strchar(k), space)
  53.      end
  54. end
  55.  
  56. -- Remove multiple spaces
  57. for i=10, 2, -1 do
  58.      text = text:gsub(strrep(space, i), space)
  59. end
  60.  
  61. -- Collect words in table
  62. local words = {}
  63. for k, v in pairsByKeys({strsplit(space, text)}) do
  64.      -- ignore single letters
  65.      if v:len() > 1 then
  66.           words[v] = true
  67.      end
  68. end
Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.
__________________

Last edited by MunkDev : 09-17-15 at 09:27 PM.
  Reply With Quote
09-18-15, 07:13 AM   #2
Lombra
A Molten Giant
 
Lombra's Avatar
AddOn Author - Click to view addons
Join Date: Nov 2006
Posts: 554
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

Code:
text:gsub("%A+", " ")
+ matches occurences of one or more continguous characters.

You can do the same for spaces, instead of checking for different lengths in separate steps.
Code:
text:gsub("%s+", " ")
%s matches whitespace.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.
__________________
Grab your sword and fight the Horde!
  Reply With Quote
09-18-15, 07:56 AM   #3
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by Lombra View Post
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.
So let's say I want to omit standalone numbers, but keep stuff like "ContainerFrame1Item1" without it being split into "ContainerFrame" and "Item", how would I go about doing that, considering "%A+" will just get rid of all those numbers?
__________________
  Reply With Quote
09-18-15, 08:00 AM   #4
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
Character classes like %A are not localized, so that approach wouldn't work for non-English users.

For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space:
Code:
text = gsub(text, "%s%s+", " ")
For removing "forbidden" characters, rather than using a table and a bunch of gsub and strchar operations, just use a character class and a single operation:
Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
If you still need the table for other purposes, you can build the character class at load-time by iterating over the table, but don't forget to escape "magic characters".

Edit: To remove only standalone numbers, use a frontier pattern:
Code:
text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "")
__________________
Retired author of too many addons.
Message me if you're interested in taking over one of my addons.
Don’t message me about addon bugs or programming questions.
  Reply With Quote
09-18-15, 08:14 AM   #5
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
This doesn't remove any of the characters supplied in the character class.
__________________
  Reply With Quote
09-18-15, 08:19 AM   #6
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:
text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")
__________________
Retired author of too many addons.
Message me if you're interested in taking over one of my addons.
Don’t message me about addon bugs or programming questions.
  Reply With Quote
09-18-15, 08:25 AM   #7
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by Phanx View Post
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:
text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")
Yeah, I just edited the pattern to this:
Code:
text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
... and it seemed to do the trick.
One thing missing in the pattern you supplied, is escaping backslash!

One last thing, omitting single letters. At this point, using these three:
Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%s+", space)
text = text:gsub("%f[%a%d]%d+%f[%A%D]", "")
... there will be only words left, apart from single letter variables. They are pretty useless in a dictionary.
__________________

Last edited by MunkDev : 09-18-15 at 08:51 AM.
  Reply With Quote

WoWInterface » Developer Discussions » Lua/XML Help » Pattern matching optimization


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off