Pattern matching optimization

MunkDev · 09-17-15, 09:21 PM

My understanding of pattern matching and string manipulation in lua is limited at best. I'm currently developing a library that will use a tree structured dictionary to look up possible suggestions as the user inputs text into an EditBox.

The dictionary will be split into two parts; one static, default dictionary with pre-defined entries and one dynamic dictionary which uses the current text of the EditBox to further add suggestions.

My question is regarding how I can strip a lengthy piece of text of all characters that are not words in the best way. My current approach is using a table of "forbidden" characters which is then used to replace each occurrence with whitespace. After that, all multiple occurences of whitespace are replaced by single whitespaces before splitting the string into table entries. This results in a lot of repeats in the returned substrings because syntax is rarely used just once. Can this be done more efficiently?

This is what the code looks like:

Lua Code:

-- Byte table with forbidden characters
local splitByte = {
     [1]   = true, -- no idea
     [10]  = true, -- newline
     [32]  = true, -- space
     [34]  = true, -- ""
     [35]  = true, -- #
     [37]  = true, -- %
     [39]  = true, -- '
     [40]  = true, -- (
     [41]  = true, -- )
     [42]  = true, -- *
     [43]  = true, -- +
     [44]  = true, -- ,
     [45]  = true, -- -
     [46]  = true, -- .
     [47]  = true, -- /
     [48]  = true, -- 0
     [49]  = true, -- 1
     [50]  = true, -- 2
     [51]  = true, -- 3
     [52]  = true, -- 4
     [53]  = true, -- 5
     [54]  = true, -- 6
     [55]  = true, -- 7
     [56]  = true, -- 8
     [57]  = true, -- 9
     [58]  = true, -- :
     [59]  = true, -- ;
     [60]  = true, -- <
     [62]  = true, -- >
     [61]  = true, -- =
     [91]  = true, -- [
     [93]  = true, -- ]
     [94]  = true, -- ^
     [123] = true, -- {
     [124] = true, -- |
     [125] = true, -- }
     [126] = true, -- ~
}
 
local n = CodeMonkeyNotepad -- the editbox 
local text = n:GetText() -- get the full text string
local space = strchar(32)
 
-- Replace with space
for k, v in pairsByKeys(splitByte) do
     -- treat numbers differently
     if k < 48 or k > 57 then
          text = text:gsub("%"..strchar(k), space)
     else
          text = text:gsub(strchar(k), space)
     end
end
 
-- Remove multiple spaces
for i=10, 2, -1 do
     text = text:gsub(strrep(space, i), space)
end
 
-- Collect words in table
local words = {}
for k, v in pairsByKeys({strsplit(space, text)}) do
     -- ignore single letters
     if v:len() > 1 then
          words[v] = true
     end
end

Here's an example output, using the actual code as text inside the EditBox:

Note: pairsByKeys is just a custom table iterator that sorts by key.

Lombra · 09-18-15, 07:13 AM

Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

Code:

text:gsub("%A+", " ")

+ matches occurences of one or more continguous characters.

You can do the same for spaces, instead of checking for different lengths in separate steps.

Code:

text:gsub("%s+", " ")

%s matches whitespace.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

MunkDev · 09-18-15, 07:56 AM

Originally Posted by Lombra

Not entirely sure how you're wanting to define "words". If it's just letters, you can use the 'non letter' class %A.

If you're looking to extract keywords and variables though, you might want to also consider underscores and non leading digits.

So let's say I want to omit standalone numbers, but keep stuff like "ContainerFrame1Item1" without it being split into "ContainerFrame" and "Item", how would I go about doing that, considering "%A+" will just get rid of all those numbers?

Phanx · 09-18-15, 08:00 AM

Character classes like %A are not localized, so that approach wouldn't work for non-English users.

For removing multiple spaces, I'd strongly recommend Lombra's solution, though I'd amend it to only bother performing a replacement if there's more than one space:

Code:

text = gsub(text, "%s%s+", " ")

For removing "forbidden" characters, rather than using a table and a bunch of gsub and strchar operations, just use a character class and a single operation:

Code:

text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")

If you still need the table for other purposes, you can build the character class at load-time by iterating over the table, but don't forget to escape "magic characters".

Edit: To remove only standalone numbers, use a frontier pattern:

Code:

text = gsub(text, "%f[%a%d]%d+%f[%A%D]", "")

MunkDev · 09-18-15, 08:14 AM

Code:

text = gsub(text, "[\1\10\32\34#%'\(\)\*\+,\-\./%d:;<>=\[\]^{|}~]", "")

This doesn't remove any of the characters supplied in the character class.

Phanx · 09-18-15, 08:19 AM

Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

MunkDev · 09-18-15, 08:25 AM

Originally Posted by Phanx

Oh, oops, I've been writing too much JavaScript at work. Try escaping the characters correctly for Lua.

Code:

text = gsub(text, "[\1\10\32\34#%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", "")

Yeah, I just edited the pattern to this:

Code:

text = text:gsub("[\1\10\32\34%\\#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)

... and it seemed to do the trick.

One thing missing in the pattern you supplied, is escaping backslash!

One last thing, omitting single letters. At this point, using these three:

Code:

text = text:gsub("[\1\10\32\34\92#%%%'%(%)%*%+,%-%./%d:;<>=%[%]^{|}~]", space)
text = text:gsub("%s%s+", space)
text = text:gsub("%f[%a%d]%d+%f[%A%D]", "")

... there will be only words left, apart from single letter variables. They are pretty useless in a dictionary.