注册 登录  
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Koala++'s blog

计算广告学 RTB





2009-12-19 15:39:18|  分类: 搜索引擎 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

An Implementation of Double-Array Trie



  1. What is Trie?
  2. What Does It Take to Implement a Trie?
  3. Tripple-Array Trie
  4. Double-Array Trie
  5. Suffix Compression
  6. Key Insertion
  7. Key Deletion
  8. Double-Array Pool Allocation
  9. An Implementation
  10. Download
  11. References

What is Trie?

Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.) [Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".


       双数组Trie的一种实现 - quweiprotoss - Koala++s blog

Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See [Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, one by one character in the key string is taken to determine the next state to go. The edge labeled with the same character is chosen to walk. Notice that each step of such walking consumes one character from the key and descends one step down the tree. If the key is exhausted and a leaf node is reached, then we arrive at the exit for that key. If we get stuck at some node, either because there is no branch labeled with the current character we have or because the key is exhausted at an internal node, then it simply implies that the key is not recognized by the trie.


Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.


In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of one node without consuming a character from the key, we can fix the deletion typo, or even substitution typo if we just drop the key character that has no branch to go and descend to all the immediate children of the current node.


What Does It Take to Implement a Trie?

In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.

通常,一个DFA是用一个transition table表示,它的行对应状态s,它的列对应转换标签s。每一个单元中的数据当输入与标记相同时,给定一个状态后要到达的状态。

This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have only a few branches, leaving the majority of the table cells blanks.


Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.


Hence, table compression techniques which still allows fast access have been devised to solve the problem.


  1. [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.
  2. [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.

Tripple-Array Trie

As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.

[Aho+1985]中解释到,一个DFA压缩可以用四个数据来完成,分别是default, base, next, check。但是与语法分析不同,比如用于信息检索的triedefault可以被忽略。所以,一个trie可以用三个数组来实现。


The tripple-array structure is composed of:

  1. base. Each element in base corresponds to a node of the trie. For a trie node s, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.
  2. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.
  3. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.


1.  base.base中的每个元素对应trie中的一个结点,对于一个trie结点sbase[s]nextcheck pool的起始索引,它们是在转移表中结点s的行。

2.  next. 这个数组,与check配合,提供一个pool来分配trie转移表中稀疏向量。这个向量数据是从每一个结点的转移向量,会被保存在这个数组中。

3.  check. 这个数组与next平行地工作。它标记在next单元的owner。它允许单元s转移到别一个单元可以分配到不同的trie结点s。这意味着转移s 的稀疏向量s可以从不同的结点转移到。

Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

check[base[s] + c] = s
next[base[s] + c] = t


定义 1. 对于从状态sc做为输入,转移到t,保存在三数组中trie条件是:

check[base[s] + c] = s
next[base[s] + c] = t

         双数组Trie的一种实现 - quweiprotoss - Koala++s blog


According to definition 1, the walking algorithm for a given state s and the input character c is:


  t := base[s] + c;

  if check[t] = s then

      next state := next[t]






To insert a transition that takes character c to traverse from a state s to another state t, the cell next[base[s] + c]] must be managed to be available. If it is already vacant, we are lucky. Otherwise, either the entire transition vector for the current owner of the cell or that of the state s itself must be relocated. The estimated cost for each case could determine which one to move. After finding the free slots to place the vector, the transition vector must be recalculated as follows. Assuming the new place begins at b, the procedure for the relocation is:

插入一个从状态s到另一个状态t的转移,单元next[base[s]+c]必须是空闲的。如果它就是空的,那么我们很幸运。不然,或是将当前 owner的整个转移向量重新分配,或是状态s自己要重新分配。这两种情况的估计代价会决定采用哪用方式。在找到空闲的位置来放这个向量,转移向量必须重新如下计算,假设新的位置开始于b,重新分配的过程如下:

Procedure Relocate(s : state; b : base_index)

{ Move base for state s to a new place beginning at b }


    foreach input character c for the state s

    { i.e. foreach c such that check[base[s] + c]] = s }


        check[b + c] := s;     { mark owner }

        next[b + c] := next[base[s] + c];     { copy data }

        check[base[s] + c] := none     { free the cell }


    base[s] := b


      双数组Trie的一种实现 - quweiprotoss - Koala++s blog


阅读(1951)| 评论(1)
推荐 转载



<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->


网易公司版权所有 ©1997-2018