注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

双数组Trie的一种实现[2]  

2009-12-20 11:21:55|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Double-Array Trie

The tripple-array structure for implementing trie appears to be well defined, but is still not practical to keep in a single file. The next/check pool may be able to keep in a single array of integer couples, but the base array does not grow in parallel to the pool, and is therefore usually split.

三数组结构来实现trie看真似为是一个很好的方式,但是将它保存到一个单一文件还是不现实的,next/check pool也许可以保存在整型对的一个单个数组,但是base数组不与pool平行地增长,所以通常分离。

To solve this problem, [Aoe1989] reduced the structure into two parallel arrays. In the double-array structure, the base and next are merged, resulting in only two parallel arrays, namely, base and check.

解决这个问题,[Aoe1989]减少这个结构到二个平行的数组,在这个双数组结构中,将basenext合并,它只使用两个平行的数组为,basecheck

Structure

Instead of indirectly referencing through state numbers as in tripple-array trie, nodes in double-array trie are linked directly within the base/check pool.

不同于三数组trie中通过状态号间接引用,在双数组trie是直接在base / check pool中链接。

Definition 2. For a transition from state s to t which takes character c as the input, the condition maintained in the double-array trie is:

check[base[s] + c] = s
base[s] + c = t

 

定义2. 对于一个接收字符c从状态s移动到t的转移,在双数组中保存的条件是:

check[base[s] + c] = s
base[s] + c = t

           双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Walking

According to definition 2, the walking algorithm for a given state s and the input character c is:

根据定义2,对于从状态s接收输入c的移动算法如下:

  t := base[s] + c;

  if check[t] = s then

      next state := t

  else

      fail

  endif

 

Construction

The construction of double-array trie is in principle the same as that of tripple-array trie. The difference is the base relocation:

构造双数组trie在原理上是与三数组trie相同的,不同点在于base的重新定位。

Procedure Relocate(s : state; b : base_index)

{ Move base for state s to a new place beginning at b }

begin

    foreach input character c for the state s

    { i.e. foreach c such that check[base[s] + c]] = s }

    begin

        check[b + c] := s;     { mark owner }

        base[b + c] := base[base[s] + c];     { copy data }

        { the node base[s] + c is to be moved to b + c;

          Hence, for any i for which check[i] = base[s] + c, update check[i] to b + c }

        foreach input character d for the node base[s] + c

        begin

            check[base[base[s] + c] + d] := b + c

        end;

        check[base[s] + c] := none     { free the cell }

    end;

    base[s] := b

end

 

               双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Suffix Compression

[Aoe1989] also suggested a storage compression strategy, by splitting non-branching suffixes into single string storages, called tail, so that the rest non-branching steps are reduced into mere string comparison.

[Aoe1989]也建议了一种压缩策略,通过拆分non-branching后缀到单一字符串保存,称为tail,这样剩余的non-branching步骤就减为字符串压缩了。

With the two separate data structures, double-array branches and suffix-spool tail, key insertion and deletion algorithms must be modified accordingly.

使用两种不同的数据结构,双数组分枝和suffix-spool tail,关键词插入和删除算法也必须相应的修改。

Key Insertion

To insert a new key, the branching position can be found by traversing the trie with the key one by one character until it gets stuck. The state where there is no branch to go is the very place to insert a new edge, labeled by the failing character. However, with the branch-tail structure, the insertion point can be either in the branch or in the tail.

插入一个新的关键词,分枝位置可以通过关键词中的字符遍历trie,直到它困住。这种状态就是没有分枝可以走,要插入一个新的边的位置,这个边标记为那个失败的字符。然而,用branch-tail结构,插入点可以在branch或是tail

1. When the branching point is in the double-array structure

Suppose that the new key is a string a1a2...ah-1ahah+1...an, where a1a2...ah-1 traverses the trie from the root to a node sr in the double-array structure, and there is no edge labeled ah that goes out of sr. The algorithm called A_INSERT in [Aoe1989] does as follows:

假设一个新的key是一个字符串a1a2...ah-1ahah+1...an,其中a1a2...ah-1trie的根遍历到结点sr,并且没有边标记为ah可以走出sr。这个算法称为A_INSERT[Aoe1989]中如下操作:

From sr, insert edge labeled ah to new node st;

Let st be a separate node poining to a string ah+1...an in tail pool.

 

                双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

2. When the branching point is in the tail pool

Since the path through a tail string has no branch, and therefore corresponds to exactly one key, suppose that the key corresponding to the tail is

a1a2...ah-1ah...ah+k-1b1...bm,

where a1a2...ah-1 is in double-array structure, and ah...ah+k-1b1...bm is in tail. Suppose that the substring a1a2...ah-1 traverses the trie from the root to a node sr.

And suppose that the new key is in the form

a1a2...ah-1ah...ah+k-1ah+k...an,

where ah+k <> b1. The algorithm called B_INSERT in
[Aoe1989] does as follows:

当路径经过tail字符串没有branch,所以对应一个关键词,假设在tail这个关键词对应的是:

a1a2...ah-1ah...ah+k-1b1...bm,

其中a1a2...ah-1在双数组结构中,而ah...ah+k-1b1...bmtail。假设子串a1a2...ah-1遍历trie从根到sr

并假设这个新的关键词是如下形式:

a1a2...ah-1ah...ah+k-1ah+k...an,
其中ah+k <> b1。这个算法在[Aoe1989]中被称为B_INSERT如下操作:

From sr, insert straight path with ah...ah+k-1, ending at a new node st;

From st, insert edge labeled b1 to new node su;

Let su be separate node pointing to a string b2...bm in tail pool;

From st, insert edge labeled ah+k to new node sv;

Let sv be separate node pointing to a string ah+k+1...an in tail pool.

 

              双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

Key Deletion

To delete a key from the trie, all we need to do is delete the tail block occupied by the key, and all double-array nodes belonging exclusively to the key, without touching any node belonging to other keys.

trie中删除一个关键词,我们仅需要的是删除被这个关键词占有的tail块,和所有只属于这个关键词的双数组结点,不改变其它关键词的结点。

Consider a trie which accepts a language K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

考虑一个trie,它接收语言K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :

      双数组Trie的一种实现[2] - quweiprotoss - Koala++s blog

The key "pool#" can be deleted by removing the tail string "ol#" from the tail pool, and node 3 from the double-array structure. This is the simplest case.

关键词"pool#"可以通过从tail pool移除tail字符串"ol#",即双数组结构中的结点3,这是最简单的情况。

To remove the key "produce#", it is sufficient to delete node 14 from the double-array structure. But the resulting trie will not obay the convention that every node in the double-array structure, except the separate nodes which point to tail blocks, must belong to more than one key. The path from node 10 on will belong solely to the key "producer#".

移除关键词"produce#",从双数组结构中删除结点14就可以了,但是产生的trie不再遵循trie的定义,即除指向tail块的分离的结点外,必须属于多个关键词。从结点10的路径仅属于关键词"producer#"

But there is no harm violating this rule. The only drawback is the uncompactnesss of the trie. Traversal, insertion and deletion algoritms are intact. Therefore, this should be relaxed, for the sake of simplicity and efficiency of the deletion algorithm. Otherwise, there must be extra steps to examine other keys in the same subtree ("producer#" for the deletion of "produce#") if any node needs to be moved from the double-array structure to tail pool.

但是违反这个规则没有什么坏处。唯一的缺点是trie不紧凑。遍历,插入,删除算法都是不改变的。所以,出于删除算法简单和高效,这是可以放宽的。不然,必须有额外的步骤来检查其它在相同子树的关键词,看是否其它结点需要从双数组结构删除到 tail pool

Suppose further that having removed "produce#" as such (by removing only node 14), we also need to remove "producer#" from the trie. What we have to do is remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which now belong solely to the key "producer#") from the double-array structure.

进一步假设删除的除了"produce#"(只删除结点14),我们还需要从trie中删除"producer#"。我们需要做的是从tail中删除"#",并删除15, 13, 12, 11, 10(它们仅属于关键词"producer#")

We can thus summarize the algorithm to delete a key k = a1a2...ah-1ah...an, where a1a2...ah-1 is in double-array structure, and ah...an is in tail pool, as follows :

我们可以总结这个算法,其中删除了一个关键词k = a1a2...ah-1ah...an,其中a1a2...ah-1是在双数组结构中,且ah...antail pool中, 如下:

Let sr := the node reached by a1a2...ah-1;

Delete ah...an from tail;

s := sr;

repeat

    p := parent of s;

    Delete node s from double-array structure;

    s := p

until s = root or outdegree(s) > 0.

 

Where outdegree(s) is the number of children nodes of s.

其中outdegree(s)s的子结点数。
  评论这张
 
阅读(3441)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017