Thread Tools Display Modes
09-17-15, 09:21 PM   #1
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Pattern matching optimization

My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:
Lua Code:
  1. -- Byte table with forbidden characters
  2. local splitByte = {
  3.      [1]   = true, -- no idea
  4.      [10]  = true, -- newline
  5.      [32]  = true, -- space
  6.      [34]  = true, -- ""
  7.      [35]  = true, -- #
  8.      [37]  = true, -- %
  9.      [39]  = true, -- '
  10.      [40]  = true, -- (
  11.      [41]  = true, -- )
  12.      [42]  = true, -- *
  13.      [43]  = true, -- +
  14.      [44]  = true, -- ,
  15.      [45]  = true, -- -
  16.      [46]  = true, -- .
  17.      [47]  = true, -- /
  18.      [48]  = true, -- 0
  19.      [49]  = true, -- 1
  20.      [50]  = true, -- 2
  21.      [51]  = true, -- 3
  22.      [52]  = true, -- 4
  23.      [53]  = true, -- 5
  24.      [54]  = true, -- 6
  25.      [55]  = true, -- 7
  26.      [56]  = true, -- 8
  27.      [57]  = true, -- 9
  28.      [58]  = true, -- :
  29.      [59]  = true, -- ;
  30.      [60]  = true, -- <
  31.      [62]  = true, -- >
  32.      [61]  = true, -- =
  33.      [91]  = true, -- [
  34.      [93]  = true, -- ]
  35.      [94]  = true, -- ^
  36.      [123] = true, -- {
  37.      [124] = true, -- |
  38.      [125] = true, -- }
  39.      [126] = true, -- ~
  40. }
  41.  
  42. local n = CodeMonkeyNotepad -- the editbox
  43. local text = n:GetText() -- get the full text string
  44. local space = strchar(32)
  45.  
  46. -- Replace with space
  47. for k, v in pairsByKeys(splitByte) do
  48.      -- treat numbers differently
  49.      if k < 48 or k > 57 then
  50.           text = text:gsub("%"..strchar(k), space)
  51.      else
  52.           text = text:gsub(strchar(k), space)
  53.      end
  54. end
  55.  
  56. -- Remove multiple spaces
  57. for i=10, 2, -1 do
  58.      text = text:gsub(strrep(space, i), space)
  59. end
  60.  
  61. -- Collect words in table
  62. local words = {}
  63. for k, v in pairsByKeys({strsplit(space, text)}) do
  64.      -- ignore single letters
  65.      if v:len() > 1 then
  66.           words[v] = true
  67.      end
  68. end
Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.
__________________

Last edited by MunkDev : 09-17-15 at 09:27 PM.
  Reply With Quote
09-18-15, 07:13 AM   #2
Lombra
A Molten Giant
 
Lombra's Avatar
AddOn Author - Click to view addons
Join Date: Nov 2006
Posts: 554
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

Code:
text:gsub("%A+", " ")
+ matches occurences of one or more continguous characters.

You can do the same for spaces, instead of checking for different lengths in separate steps.
Code:
text:gsub("%s+", " ")
%s matches whitespace.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.
__________________
Grab your sword and fight the Horde!
  Reply With Quote
09-18-15, 07:56 AM   #3
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by Lombra View Post
Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.
So let's say I want to omit standalone numbers, but keep stuff like "ContainerFrame1Item1" without it being split into "ContainerFrame" and "Item", how would I go about doing that, considering "%A+" will just get rid of all those numbers?
__________________
  Reply With Quote
09-18-15, 08:00 AM   #4
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
Character classes like %A are not localized, so that approach wouldn't work for non-English users.

For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space:
Code:
text = gsub(text, "%s%s+", " ")
For removing "forbidden" characters, rather than using a table and a bunch of gsub and strchar operations, just use a character class and a single operation:
Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
If you still need the table for other purposes, you can build the character class at load-time by iterating over the table, but don't forget to escape "magic characters".

Edit: To remove only standalone numbers, use a frontier pattern:
Code:
text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "")
__________________
Retired author of too many addons.
Message me if you're interested in taking over one of my addons.
Don’t message me about addon bugs or programming questions.
  Reply With Quote
09-18-15, 08:14 AM   #5
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Code:
text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")
This doesn't remove any of the characters supplied in the character class.
__________________
  Reply With Quote
09-18-15, 08:19 AM   #6
Phanx
Cat.
 
Phanx's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2006
Posts: 5,617
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:
text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")
__________________
Retired author of too many addons.
Message me if you're interested in taking over one of my addons.
Don’t message me about addon bugs or programming questions.
  Reply With Quote
09-18-15, 08:25 AM   #7
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by Phanx View Post
Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:
text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")
Yeah, I just edited the pattern to this:
Code:
text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
... and it seemed to do the trick.
One thing missing in the pattern you supplied, is escaping backslash!

One last thing, omitting single letters. At this point, using these three:
Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%s+", space)
text = text:gsub("%f[%a%d]%d+%f[%A%D]", "")
... there will be only words left, apart from single letter variables. They are pretty useless in a dictionary.
__________________

Last edited by MunkDev : 09-18-15 at 08:51 AM.
  Reply With Quote
09-18-15, 10:17 AM   #8
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%a%s", space)
text = text:gsub("%s%a%s", space)
text = text:gsub("%s%s+", space)
This works, but repeats the operation twice to get rid of the cases where two single letters are separated by only one whitespace, such as..
Lua Code:
  1. -- Original code
  2. function someFunction(t,f) print(t,f) end
  3.  
  4. -- Step 1
  5. function someFunction t f print t f end
  6.  
  7. -- Step 2
  8. function someFunction f print f end
  9.  
  10. -- Step 3
  11. function someFunction  print  end
  12.  
  13. -- End result
  14. function someFunction print end
__________________

Last edited by MunkDev : 09-18-15 at 10:26 AM.
  Reply With Quote
09-18-15, 01:20 PM   #9
SDPhantom
A Pyroguard Emberseer
 
SDPhantom's Avatar
AddOn Author - Click to view addons
Join Date: Jul 2006
Posts: 2,326
Originally Posted by MunkDev View Post
Code:
text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
You might want to remove \32 because it ends up replacing spaces with itself.
The way Lua handles strings, there's no need to store it in a constant. It's actually using up more resources to do so.



Originally Posted by MunkDev View Post
Code:
local space = strchar(32)
You should never use this function with a static number.
Just use a literal string in the gsub() call as noted previously.
__________________
WoWInterface AddOns
"All I want is a pretty girl, a decent meal, and the right to shoot lightning at fools."
-Anders (Dragon Age: Origins - Awakening)

Last edited by SDPhantom : 09-18-15 at 01:36 PM.
  Reply With Quote
09-18-15, 02:04 PM   #10
semlar
A Pyroguard Emberseer
 
semlar's Avatar
AddOn Author - Click to view addons
Join Date: Sep 2007
Posts: 1,060
It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.
Lua Code:
  1. for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end

Code:
function
someFunction
print
end

Last edited by semlar : 09-18-15 at 02:42 PM.
  Reply With Quote
09-19-15, 02:24 AM   #11
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by semlar View Post
It's possible I'm misunderstanding what the goal is here, but why aren't you just matching the word pattern you're searching for?

eg.
Lua Code:
  1. for word in string.gmatch('function someFunction(t,f) print(t,f) end', '[_%a][_%w]+') do print(word) end
This is a very simple and effective approach. Could you please break down how the pattern works?
__________________
  Reply With Quote
09-19-15, 01:02 PM   #12
semlar
A Pyroguard Emberseer
 
semlar's Avatar
AddOn Author - Click to view addons
Join Date: Sep 2007
Posts: 1,060
Originally Posted by MunkDev View Post
This is a very simple and effective approach. Could you please break down how the pattern works?
%a represents all letters, %w represents all letters and numbers.

[_%a][_%w]+ says to match an underscore or a letter, followed by at least one underscore or letter or number.

Without defining word boundaries this approach isn't perfect, if your source text contains something like 123InvalidVariable-Name, it will match "InvalidVariable".

Lua doesn't support lookahead/behind, but you may be able to surround your pattern with a character class including whitespace, parentheses, commas, etc. which must come before and after your word to be considered valid.

Here's a potential pattern you can try based on phanx's character class earlier..
Lua Code:
  1. for word in string.gmatch([[function
  2. someFunction(t,f) print(to,fr) _ContainerFrame1_1 end]], '%f[^\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]([_%a][_%w]+)%f[\0%s\'\"\\/#%%()*+,.:;<>=%[%]{|}~]') do print(word) end

Last edited by semlar : 09-19-15 at 01:58 PM.
  Reply With Quote
09-20-15, 05:48 PM   #13
SDPhantom
A Pyroguard Emberseer
 
SDPhantom's Avatar
AddOn Author - Click to view addons
Join Date: Jul 2006
Posts: 2,326
How about something like this?

Code:
for word in string.gmatch(" "..code,"[ ;]+([_%a][_%w]*) do
This allows single character names by making the second character optional (* defines 0 or more, tries to match as many as possible). This also injects a delimiter before the string to make the matching work correctly. There are also problems that could happen depending on where spaces are put that is valid Lua code, but would mess up this tokenizer.
__________________
WoWInterface AddOns
"All I want is a pretty girl, a decent meal, and the right to shoot lightning at fools."
-Anders (Dragon Age: Origins - Awakening)

Last edited by SDPhantom : 09-20-15 at 05:51 PM.
  Reply With Quote
09-20-15, 06:18 PM   #14
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
I can see how in the interest of writing code, it would seem appropriate to filter out words attached to malformed snippets, but really, this library should aim to be as generally applicable as possible. I will port it for use with other stuff, like a chat addon or taking notes on screen. Might even be useful for my controller addon, considering I haven't built chat functionality for it yet. Besides, my editor debugs code automatically and will throw an error when the user has invalid variable names.



Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered? I can't make up my mind on that point, considering there are some cases where you want to consistently use the same numbered variable and some cases where you want to change the numbers.
__________________

Last edited by MunkDev : 09-20-15 at 06:25 PM.
  Reply With Quote
09-20-15, 08:19 PM   #15
semlar
A Pyroguard Emberseer
 
semlar's Avatar
AddOn Author - Click to view addons
Join Date: Sep 2007
Posts: 1,060
Originally Posted by MunkDev View Post
Let me ask this instead, as a programmer using an IDE, would you rather it store "ContainerFrame" and "Item" in suggestions, or "ContainerFrame4Item16", if that was what you originally entered?
I think it should only be suggesting full variable names that actually exist within the current document, a partial match would only be confusing.
  Reply With Quote
09-20-15, 10:07 PM   #16
kurapica.igas
A Chromatic Dragonspawn
Join Date: Aug 2011
Posts: 152
Normally, we only need the keywords and identifiers for auto-completation.

I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) .

Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking.

Making an editor is fun, but with wow, it's a pain in the ass.

Click image for larger version

Name:	AutoComplete.jpg
Views:	425
Size:	21.3 KB
ID:	8650
  Reply With Quote
09-23-15, 08:34 AM   #17
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Originally Posted by kurapica.igas View Post
Normally, we only need the keywords and identifiers for auto-completation.

I see you are using the kristofer's Indent.lua. I think you can record those words when the lib scan for identifiers(tokentype == tokens.TOKEN_IDENTIFIER, record it) .

Using a dichotomy search can quickly check if the identifier existed or not, or just keep a cache table for checking.

Making an editor is fun, but with wow, it's a pain in the ass.

Attachment 8650
Do you have a code snippet for that? I was thinking of thinning out the table by matching each letter, which will of course remove unwanted results with each iteration, but it's by no means a binary algorithm.
__________________
  Reply With Quote
09-23-15, 09:43 AM   #18
kurapica.igas
A Chromatic Dragonspawn
Join Date: Aug 2011
Posts: 152
Well, I don't use the Indent.lua, my code can be found in CodeEditor

I don't use a tree table to store the keywords, only one sorted table is used, like self.AutoCompleteList contains all identifiers(self is the editor).

I give each match words a weight, that based on the input word, to make sure the first word is the most wanted word.

To create a list based on the input word, the core func in the file is
  • Compare -- Used to compare words without case.
  • CompareWeight -- Used to compare two word's weight that based on the input word.
  • GetIndex -- Use dichotomy query to find the index of the word
  • TransMatchWord -- Create a match pattern based on the input word, 'pt' -> '^([pP])([%w_]-)([tT])([%w_]-)$', this would be used on match words, so weight and color can be applied to those words(check the pic).
  • ApplyColor -- Apply color to the match word.
  • ApplyAutoComplete -- The main func.

The filter job is done in those code :

Lua Code:
  1. local uword = "^" .. word:gsub("[%w_]", TransMatchWord) .. "$"  -- the pattern
  2. local header = word:sub(1, 1)       -- the first char of the match word must be the same to the input word
  3.  
  4. local lst = self.AutoCompleteList    -- the sorted identifiers list based on the Compare func
  5. local sIdx = GetIndex(lst, header) -- get start index based on the first char, where we start checking
  6.  
  7. if sIdx == 0 then sIdx = 1 end
  8.  
  9. for i = sIdx, #lst do
  10.     local value = lst[i]
  11.     if #value == 0 or Compare(header, value:sub(1, 1)) then break end  -- no need to continue
  12.  
  13.     _AutoCheckWord = value  -- Keep word to be used in ApplyColor
  14.  
  15.     if _AutoCheckWord:match(uword) then
  16.         _AutoCheckWord:gsub(uword, ApplyColor)
  17.  
  18.         tinsert(_AutoCacheKeys, _AutoCheckWord)
  19.     end
  20. end
  21.  
  22. -- ...
  23. Array.Sort(_AutoCacheKeys, CompareWeight)  -- Sort match words based on their weight

May it can help you.
  Reply With Quote

WoWInterface » Developer Discussions » Lua/XML Help » Pattern matching optimization


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off