注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB

 
 
 

日志

 
 

双数组Trie的一种实现[3]  

2009-12-20 11:33:35|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

Double-Array Pool Allocation

When inserting a new branch for a node, it is possible that the array element for the new branch has already been allocated to another node. In that case, relocation is needed. The efficiency-critical part then turns out to be the search for a new place. A brute force algorithm iterates along the check array to find an empty cell to place the first branch, and then assure that there are empty cells for all other branches as well. The time used is therefore proportional to the size of the double-array pool and the size of the alphabet.

当对一个结点插入一个新的分枝,有可能数组中为新的分枝的位置已经分配给了其它结点,在这种情况下,需要重新定位,影响效率的部分就是寻找一个新的位置,一种brute force(蛮力)算法迭代check数组找出一个空的单元来放置第一个分枝,然后假设还有空的单元来放其它的分枝。所以时间复杂性与双数组的大小和字母集的大小成正比。

Suppose that there are n nodes in the trie, and the alphabet is of size m. The size of the double-array structure would be n + cm, where c is a coefficient which is dependent on the characteristic of the trie. And the time complexity of the brute force algorithm would be O(nm + cm2).

假设在trie中有n个结点,字母集的大小为m。双数组结构的大小为n+cm,其中c是一个依赖trie字符形式的一个系数。Brute force算法的时间复杂性是O(nm + cm2)

[Aoe1989] proposed a free-space list in the double-array structure to make the time complexity independent of the size of the trie, but dependent on the number of the free cells only. The check array for the free cells are redefined to keep a pointer to the next free cell (called G-link) :

[Aoe1989]提出了在双数组结构中一个free-space list使时间复杂性与trie的大小无关,但仅与空闲单元的方法。Check数组的空闲单元被重定义为保存一下指向下一个空闲单元的指针(称为G-link)

Definition 3. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = -1

 

定义3. r1, r2, ... , rcm为双数组结构中的空闲单元,以位置排序。G-link定义如下:

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = -1

By this definition, negative check means unoccupied in the same sense as that for "none" check in the ordinary algorithm. This encoding scheme forms a singly-linked list of free cells. When searching for an empty cell, only cm free cells are visited, instead of all n + cm cells as in the brute force algorithm.

根据这个定义,负check意味着空闲与在普通算法中的"none" check是同一个意思,这种编码方式形成了一个空闲单无单一链接的链表。当搜索一个空闲单元,仅cm空闲单元被访问,而不是在brute force算法中的所有n+cm

This, however, can still be improved. Notice that for those cells with negative check, the corresponding base's are not given any definition. Therefore, in our implementation, Aoe's G-link is modified to be doubly-linked list by letting base of every free cell points to a previous free cell. This can speed up the insertion and deletion processes. And, for convenience in referencing the list head and tail, we let the list be circular. The zeroth node is dedicated to be the entry point of the list. And the root node of the trie will begin with cell number one.

这仍然是可以提高的,注意那些是负check的单元,对应的base单元没有任何给定的定义,所以,在我们的实现中,Aoe’s G-link被修改为双链接的列表,使base中的每一个空闲单元指到前一个空闲单元。这可以提高插入和删除的速度。并且为了方便引用列表的头和尾,我们使用了循环链表,第0个结点被指定为列表的入口。并且trie的根结点开始于第1个单元。

Definition 4. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri ; 1 <= i <= cm-1

 

定义4. r1, r2, ... , rcm为双数组结构中的空闲单元,以位置排序,G-link的定义如下:

 

check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri ; 1 <= i <= cm-1

 

Then, the searching for the slots for a node with input symbol set P = {c1, c2, ..., cp} needs to iterate only the cells with negative check :

那么,为输入符号集P = {c1, c2, ..., cp}寻找一个结点的单元,只需要在有负check的单元中迭代。

{find least free cell s such that s > c1}

s := -check[0];

while s <> 0 and s <= c1 do

    s := -check[s]

end;

if s = 0 then return FAIL;  {or reserve some additional space}

 

{continue searching for the row, given that s matches c1}

while s <> 0 do

    i := 2;

    while i <= p and check[s + ci - c1] < 0 do

        i := i + 1

    end;

    if i = p + 1 then return s - c1;  {all cells required are free, so return it}

    s := -check[-s]

end;

return FAIL;  {or reserve some additional space}

 

The time complexity for free slot searching is reduced to O(cm2). The relocation stage takes O(m2). The total time complexity is therefore O(cm2 + m2) = O(cm2).

搜索slot的时间复杂度下降到O(cm2)。重定位步骤占O(m2)。所以全部的时间复杂度是O(cm2 + m2) = O(cm2)

It is useful to keep the free list ordered by position, so that the access through the array becomes more sequential. This would be beneficial when the trie is stored in a disk file or virtual memory, because the disk caching or page swapping would be used more efficiently. So, the free cell reusing should maintain this strategy :

free list以位置排序是有好处的,这样访问位置可以是顺序访问,当trie保存在磁盘文件或是虚拟内存中,这会是有好处的,因为磁盘缓存和交换页的使用会更有效。所以,空闲单元重用应保持这个策略。

t := -check[0];

while check[t] <> 0 and t < s do

    t := -check[t]

end;

{t now points to the cell after s' place}

check[s] := -t;

check[-base[t]] := -s;

base[s] := base[t];

base[t] := -s;

 

Time complexity of freeing a cell is thus O(cm).

释放一个单元的时间复杂度是O(cm)

An Implementation

In my implementation, I exploit the concept of virtual memory to manage large and permenent trie. A base class called VirtualMem divides the data file into pages, and swaps them in as needed. Memory economy and disk caching is thus achieved. Different memory accessing methods are provided so that the page swapping mechanism is hidden from the class users. Meanwhile, byte order is internally managed on-the-fly inside the methods to achieve data portability.

Double-array structure and tail pool are then built upon the VirtualMem base class. Trie class is then created to contain the two data structures, with basic interfaces for trie manipulation.

Download

Update: The double-array trie implementation has been simplified and rewritten from scratch in C, and is now named libdatrie. It is now available under the terms of GNU Lesser General Public License (LGPL):

SVN: svn co http://linux.thai.net/svn/software/datrie

The old C++ source code below is under the terms of GNU Lesser General Public License (LGPL):

References

  1. [Knuth1972] Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and Searching. Addison-Wesley. 1972.
  2. [Fredkin1960] Fredkin, E. Trie Memory. Communication of the ACM. Vol. 3:9 (Sep 1960). pp. 490-499.
  3. [Cohen1990] Cohen, D. Introduction to Theory of Computing. John Wiley & Sons. 1990.
  4. [Johnson1975] Johnson, S. C. YACC-Yet another compiler-compiler. Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.
  5. [Aho+1985] Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles, Techniques, and Tools. Addison-Wesley. 1985.
  6. [Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.
  7. [Virach+1993] Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. 5th Annual Conference. National Electronics and Computer Technology Center. Bangkok. 1993. pp 197-206. (in Thai)

Theppitak Karoonboonyanan
Created: 1999-06-13
Last Updated 2009-07-12

 

 

koala++ 附注:《双数组Trie树算法优化及其应用研究》这篇论文是同事推荐我看的,并且我看到也有人实现时采用了他的方法。因为一眼注意到文中英文摘要把processed写成了pracessed,因为中文论文一般用word排版,所以感觉这是不可原谅的,英文文法也有些问题。文中还有多处输入的错误,并且题目中的优化也太夸张了,trie本来是支持动态的插入,删除的,用他的方法虽然也可以动态插入,删除,但总感觉不是那么回事。但是文中解释的还是比较清楚,不同于一般的垃圾论文,也可以看看。

  评论这张
 
阅读(1964)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017