Thread Tools Display Modes
11-06-15, 11:34 AM   #1
MunkDev
A Scalebane Royal Guard
 
MunkDev's Avatar
AddOn Author - Click to view addons
Join Date: Mar 2015
Posts: 431
Efficient string and table manipulation for an in-game dictionary

Hello, friends.

I've written code to auto-complete words in editboxes, using the global environment to scan for and break apart strings to generate a weighted dictionary. On an enUS client, the initial run generates approximately 5000 words, which is a pretty good starting point.

Now, when iterating the table of words to find potential matches, I'm using these conditions:
  • Literal matches in words
  • The union of characters in the current input as a length barrier
  • The existence of the union of characters of the current input
  • The dictionary word is not exactly equal to the input word
  • The priority of suggestions is based in similarity, length difference and frequency

The downside to my current approach is:
  • If the input word is misspelled, suggestion for anticipated word will be omitted
  • If the input characters are in the wrong order, the first suggestion might be unexpected

Example:
  • Misspell: "abhieve" -> "achieve", "achievement", "achievements" (these are omitted)
  • Skewed: "abhieve" -> "behaving", "behavior", "behaviors" (these are suggested)

Also, to reduce "unnecessary" calculations, I've put a 20 index max on the generated suggestion list, meaning it will stop comparing words of the dictionary to the suggestion list after hitting the 20th index. The word will still be pushed to the list, though. Without optimizing to some degree, there will be a small FPS stutter when using extremely common combinations of characters, such as "es", "er", "et", "ae", "an", etc.

I was wondering if anyone has ideas to improve this algorithm and make it smarter.

This is the code used to generate suggestions:
Lua Code:
  1. function Keyboard:GetSuggestions()
  2.     local word = self:GetCurrentWord() -- get the currently chosen word in the editbox
  3.     wipe(suggestions) -- remove all old suggestions before proceeding
  4.     if word then
  5.         word = strlower(word)
  6.         local length = word:len()
  7.         local dictionary = self.Dictionary
  8.         local chars, numChars = Union(word) -- returns the union of characters in a word
  9.  
  10.         local this, priority, valid
  11.  
  12.         -- Iterate through dictionary and push valid words to suggestion list
  13.         for thisWord, thisWeight in pairs(dictionary) do
  14.             valid = true
  15.  
  16.             -- skip exact matches and matches that are vastly longer than the union of input characters
  17.             if thisWord == word or numChars * 4 < thisWord:len() then
  18.                 valid = false
  19.             end
  20.  
  21.             -- check if word contains all characters to elicit a match
  22.             for c in chars:gmatch(".") do
  23.                 if not strfind(thisWord, c) then
  24.                     valid = false
  25.                     break
  26.                 end
  27.             end
  28.  
  29.             if valid then
  30.                 priority = 1
  31.  
  32.                 this = {
  33.                     word = thisWord,
  34.                     weight = thisWeight,
  35.                     match = strfind( thisWord, word ),
  36.                     length = abs( length - thisWord:len() )
  37.                 }
  38.  
  39.                 -- calculate priority in relation to already suggested words
  40.                 for index, compare in pairs(suggestions) do
  41.  
  42.                     -- don't calculate order beyond the 20th index, just push to list
  43.                     if ( index >= 20 ) then
  44.                         break
  45.                     end
  46.  
  47.                     -- fix: if the best suggestion was pushed first, make sure its priority isn't nudged down
  48.                     priority = #suggestions > 1 and index or 2
  49.  
  50.                     -- words with literal matches have higher priority
  51.                     if  ( this.match and not compare.match ) then
  52.                         break
  53.                     -- if both have literal match and this word is more favorable, or neither have literal match
  54.                     elseif  ( this.match and compare.match and ( this.match <= compare.match ) ) or
  55.                             ( not this.match and not compare.match ) then
  56.  
  57.                         -- if this word is shorter, or equally long but has a higher weight
  58.                         if  ( this.length < compare.length or
  59.                             ( this.length == compare.length and this.weight > compare.weight ) ) then
  60.                             break
  61.                         end
  62.  
  63.                     end
  64.                 end
  65.                 tinsert( suggestions, priority, this )
  66.             end
  67.         end
  68.     end
  69.     self:SetSuggestions(1)
  70. end

This is the code used to generate the dictionary:
Lua Code:
  1. function Keyboard:GenerateDictionary()
  2.    -- generates a localized game-oriented dictionary (5000+ words on enUS clients)
  3.    -- scan the global environment for strings
  4.    local genv = getfenv(0)
  5.    local dictionary = {}
  6.    for index, object in pairs(genv) do
  7.       if type(object) == "string" then
  8.          -- remove escape sequences, uses a table of patterns for gsub
  9.          object = Unescape(object)
  10.          -- scan each string for individual words
  11.          for word in object:gmatch("[%a][%w']*[%w]+") do
  12.             word = strlower(word)
  13.             if not dictionary[word] then -- first occurence
  14.                dictionary[word] = 1
  15.             else -- weight increment
  16.                dictionary[word] = dictionary[word] + 1
  17.             end
  18.          end
  19.       end
  20.    end
  21.    return dictionary
  22. end
Pastebin: dictionary dump
__________________

Last edited by MunkDev : 11-07-15 at 09:39 PM.
  Reply With Quote
11-07-15, 10:56 AM   #2
Barjack
A Black Drake
AddOn Author - Click to view addons
Join Date: Apr 2009
Posts: 89
Interesting stuff. And a novel approach with the dictionary building.

I'm not sure about the best way to find misspelt words. But I did read this article a while ago: http://www.norvig.com/spell-correct.html

It implements a very simple spell checker in Python, and at the bottom are links to other languages as well (since we're talking Lua perhaps a Javascript one would be the closest equivalent). I'm not sure how CPU-heavy it is, though... Perhaps you'd need to only check at certain intervals or reduce the algorithm's complexity if it must run more often. But maybe it's a starting point.

Let us know what you end up doing, it's an interesting problem so I'd like to know what you find out.
  Reply With Quote
11-09-15, 11:42 PM   #3
kurapica.igas
A Chromatic Dragonspawn
Join Date: Aug 2011
Posts: 152
I try it a little, without optimization or deeply consideration :

Lua Code:
  1. dict = {"AbandonQuest","AbandonSkill","AcceptAreaSpiritHeal","AcceptBattlefieldPort","AcceptDuel","AcceptGroup","AcceptGuild","AcceptLevelGrant","AcceptQuest","AcceptResurrect","AcceptSockets","AcceptXPLoss","ActionHasRange","AddAutoQuestPopUp","AddChatWindowChannel","AddChatWindowMessages","AddFriend","AddIgnore","AddMute","AddOrDelIgnore","AddOrDelMute","AddOrRemoveFriend","AddQuestWatch","AddTrackedAchievement","AddTradeMoney" }
  2.  
  3. -- Init the missspell diff check based on keybord
  4. _Keyboard = {
  5.     {"q", "w", "e", "r", "t", "y", "u", "i", "o", "p"},
  6.     {"a", "s", "d", "f", "g", "h", "j", "k", "l"},
  7.     {"", "z", "x", "c", "v", "b", "n", "m" }
  8. }
  9.  
  10. for i, v in ipairs(_Keyboard) do
  11.     for j, c in ipairs(v) do
  12.         _Keyboard[c] = {i, j}
  13.     end
  14. end
  15.  
  16. function misspellDiff(a, b)
  17.     local ka = _Keyboard[a]
  18.     local kb = _Keyboard[b]
  19.  
  20.     return math.abs(ka[1] - kb[1]) + math.abs(ka[2] - kb[2])
  21. end
  22.  
  23. function calcDiff(tar, match)
  24.     if #tar > #match then return 9999 end
  25.  
  26.     local diff = 0
  27.     local chr
  28.     local pos
  29.     local tarChecked = {}
  30.     local matchUsed = {}
  31.     local minDiff = 9999
  32.     local minPos
  33.     local chkDiff
  34.  
  35.     -- Ignore the case
  36.     tar = tar:lower()
  37.     match = match:lower()
  38.  
  39.     -- Match what can be matched
  40.     for i = 1, #tar do
  41.         chr = tar:sub(i, i)
  42.  
  43.         pos = match:find(chr)
  44.  
  45.         while true do
  46.             if pos then
  47.                 if matchUsed[pos] then
  48.                     pos = match:find(chr, pos+1)
  49.                 else
  50.                     tarChecked[i] = pos
  51.                     matchUsed[pos] = i
  52.                     break
  53.                 end
  54.             else
  55.                 break
  56.             end
  57.         end
  58.     end
  59.  
  60.     -- Check the missspell
  61.     for i = 1, #tar do
  62.         if not tarChecked[i] then
  63.             chr = tar:sub(i, i)
  64.  
  65.             for j = 1, #match do
  66.                 if not matchUsed[j] then
  67.                     chkDiff = misspellDiff(chr, match:sub(j, j))
  68.  
  69.                     if chkDiff < minDiff then
  70.                         minDiff = chkDiff
  71.                         minPos = j
  72.                     end
  73.                 end
  74.             end
  75.  
  76.             -- Ten times so the missspell take high diff value
  77.             diff = diff + minDiff * 10
  78.             tarChecked[i] = minPos
  79.             matchUsed[minPos] = i
  80.         end
  81.     end
  82.  
  83.     -- Calc the ignored
  84.     for i = 2, #tarChecked do
  85.         diff = diff + math.abs(tarChecked[i] - tarChecked[i - 1]) - 1
  86.     end
  87.  
  88.     -- Calc the order diff
  89.     for i = 1, #tarChecked do
  90.         for j = i + 1, #tarChecked do
  91.             if tarChecked[j] < tarChecked[i] then
  92.                 diff = diff + 1
  93.             end
  94.         end
  95.     end
  96.  
  97.     return diff
  98. end
  99.  
  100. -- Main
  101. tar = [[AsTrdecht]]
  102.  
  103. local result = {}
  104. local minDiff = 9999
  105. local minMatch
  106.  
  107. for _, match in ipairs(dict) do
  108.     local diff = calcDiff(tar, match)
  109.  
  110.     print(match, diff)
  111.  
  112.     if diff < minDiff then
  113.         minDiff = diff
  114.         minMatch = match
  115.     end
  116. end
  117.  
  118. print("---------Result---------")
  119. print(tar, "->", minMatch)

Result is
---------Result---------
AsTrdecht -> AddTrackedAchievement
First find all matched chars, then check the misspell character, diff them with their distance on the keyboard(the diff has to be multiplied compared to the wrong orders).

Then calc the ignored chars like 'nd' for 'need', 'ee' is ignored. At the last, calc the diff caused by wrong orders.

Then we can get the most expected words with the least diff value.

The code is not well designed, may it can provide a little help for you.
  Reply With Quote

WoWInterface » Developer Discussions » General Authoring Discussion » Efficient string and table manipulation for an in-game dictionary

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off