An Implementation of Double-Array Trie
双数组Trie的一种实现
Contents
Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.) [Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".
Trie是一种数字搜索树,[Fredkin]引入了trie这个术语,它是Retrieval的缩写。
Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See [Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, on
Trie是一种高效的索引方法,它实际上是一种确定有限自动机(DFA),在树的结构中,每一个结点对应一个DFA状态,每一个从父结点指向子结点(有向)标记的边对应一个DFA转换。遍历从根结点开始,然后从head到tail,由关键词(本想译成键字符串,感太别扭)的每个字符来决定下一个状态,标记有相同字符的边被选中做移动。注意每次这种移动会从关键词中消耗一个字符并走向树的下一层,如果这个关键字符串空了,并且走到了叶子结点,那么我们达到了这个关键词的出口。如果我们被困在了一点结点,比如因为没有分枝被标记为当前我们有的字符,或是因为关键字符串在中间结点就空了,这表示关键字符串没有被trie认出来。
Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.
注意从根遍历到叶子的时间不依赖于数据量的大小,而与关键词的长度成比例。也就是它通常比B-tree要快的多,或是比任何基于比较的索引方法在一般情况下都快的多。它的时间复杂度可以与hashing技术相比。
In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of on
除了效率,trie可以当关键词被拼错了时,提供灵活的搜索。比如在遍历时忽略一个特定的字符,我们可以修正多输入的这种输入错误。直接遍历所有直接的子结点而不消耗一个字符,这样我可以修正少输的输入错误,或者甚至替换输入的错误,如果我们没有分支可以向下移动,就向当前结点的所有子结点直接向下移动。
What Does It Take to Implement a Trie?
In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The da
通常,一个DFA是用一个transition table表示,它的行对应状态s,它的列对应转换标签s。每一个单元中的数据当输入与标记相同时,给定一个状态后要到达的状态。
This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have on
这是一个的遍历高效方法,因为每次转换可以通过计算二维的数组索引得到。但是,从空间使用的角度看,这是相当浪费的,因为,在trie这种情况中,大多数结点只有少量的分枝,表中大多数单元是空白的。
Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.
同时,一个更紧凑的方式是用链表保存每个状态的转移,但这种方式会比较慢,因为要进行线性搜索
Hence, table compression techniques which still allows fast access have been devised to solve the problem.
所以,建议使用可以快速访问的表压缩技术。
As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.
在[Aho+1985]中解释到,一个DFA压缩可以用四个数据来完成,分别是default, base, next, check。但是与语法分析不同,比如用于信息检索的trie,default可以被忽略。所以,一个trie可以用三个数组来实现。
Structure
The tripple-array structure is composed of:
三数组结构包括:
1. base.在base中的每个元素对应trie中的一个结点,对于一个trie结点s,base[s]是next和check pool的起始索引,它们是在转移表中结点s的行。
2. next. 这个数组,与check配合,提供一个pool来分配trie转移表中稀疏向量。这个向量数据是从每一个结点的转移向量,会被保存在这个数组中。
3. check. 这个数组与next平行地工作。它标记在next单元的owner。它允许单元s转移到别一个单元可以分配到不同的trie结点s。这意味着转移s 的稀疏向量s可以从不同的结点转移到。
Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:
check[base[s] + c] = s
next[base[s] + c] = t
定义 1. 对于从状态s将c做为输入,转移到t,保存在三数组中trie条件是:
check[base[s] + c] = s
next[base[s] + c] = t
Walking
According to definition 1, the walking algorithm for a given state s and the input character c is:
根据定义1,对于一个给定状态s和一个输入字符c,移动算法如下:
t := base[s] + c;
if check[t] = s then
next state := next[t]
else
fail
endif
Construction
To insert a transition that takes character c to traverse from a state s to another state t, the cell next[base[s] + c]] must be managed to be available. If it is already vacant, we are lucky. Otherwise, either the entire transition vector for the current owner of the cell or that of the state s itself must be relocated. The estimated cost for each case could determine which on
插入一个从状态s到另一个状态t的转移,单元next[base[s]+c]必须是空闲的。如果它就是空的,那么我们很幸运。不然,或是将当前 owner的整个转移向量重新分配,或是状态s自己要重新分配。这两种情况的估计代价会决定采用哪用方式。在找到空闲的位置来放这个向量,转移向量必须重新如下计算,假设新的位置开始于b,重新分配的过程如下:
Procedure Relocate(s : state; b : base_index)
{ Move base for state s to a new place beginning at b }
begin
foreach input character c for the state s
{ i.e. foreach c such that check[base[s] + c]] = s }
begin
check[b + c] := s; { mark owner }
next[b + c] := next[base[s] + c]; { copy da
check[base[s] + c] := none { free the cell }
end;
base[s] := b
end
评论