Enhancement of Ternary Search Tree

As we memtioned in the Application of Ternary Search Tree, a typical node structure of TST is

正如我们在Ternary Search Tree的应用中提到的那样,一个典型的TST结点的结构题如下

typedef struct tsnode {
    char c;
    struct tsnode * le, * eq, * gt;
} tsnode, * tstree;

Its insertion and searching are really fast when the string in given strings set is short (I'm not going to define the exactly length of a string that we could say it's short, but we will see the decrement of running time after we applying the improvement). When the string length increases, the insertion and searching time grow, of course. The question is, we only compare one char at a time.

当给定输入集的字符串都比较短的时候(我并不打算定义低于多少个字节长才叫短,但是在应用了改进之后,我们将会明显看见运行时间的减少)。当字符串长度增加时,理所当然的,插入和搜索的时间也会增加。但问题在于,我们一次只比较了一个字符。

If we can compare mutilple char in one operation, insertion and searching could be faster. But how to achieve this? We know that char is just a basic data type with one byte memory usage. If we change it to unsigned int, there would be 4 bytes! Which means we can compare 4 char in once. If unsigned long long is used, then we can compare 8 char in one operation. That almost 8 times faster than using char!

如果我们可以一次比较多个字符的话,插入和搜索就可以变得更快。但是要怎么做到呢?我们知道char只是一种占用1个字节的基本数据类型,如果我们把它换为unsigned int的话,就有4个字节了!这也就意味着我们可以同时比较4个字符。如果使用unsigned long long的话,我们就可以一次比较8个字符了。这基本是8倍快于使用char的时候。

But what if the length of string is not a mutilple number of 4 or 8 (when we use unsigned int and unsigned long long respectively)? The answer is simple, we just append one more unsigned int/unsigned long long at the end. That's equivalent of '\0' in string in some degree.

但要是字符串的长度不是4或8(当我们分别使用unsigned int和unsigned long long时)的整倍数呢?答案很简单,我们只需要在末尾多加上一个unsigned int或unsigned long long即可。这和字符串以'\0'结尾在某种程度上是等价的。

Let's apply the new thought to TST algorithm.

让我们来尝试一下这个新想法。

template <typename InternalValueType = unsigned long long>
class TernaryTree {
public:
    TernaryTree() {
        bit_shift = log2(sizeof(InternalValueType));
    }

    /**
     *  @brief PUT
     *
     *  @return Ternary Search Tree
     */
    TernaryTree& put(const std::string& key, const std::string& value, const std::string& value, const std::function<void(TernaryTree& self, const std::string& key, const std::string& value, int status)>& callback) {
        // calculate how many elements are needed to fit 
        size_t size_to_fit = ceill((key.length() + sizeof(InternalValueType))>>bit_shift) + 1;

        // allocate memory, clear and copy
        InternalValueType * data = (InternalValueType *)malloc(sizeof(InternalValueType) * size_to_fit);
        memset(data, 0, sizeof(InternalValueType) * size_to_fit);
        memcpy(data, key.c_str(), key.length());

        // insert
        this->tree = tstree_insert(this->tree, key.c_str(), value.c_str());

        // callback
        if (this->tree == NULL) {
            callback(*this, key, value, Failed);
        } else {
            callback(*this, key, value, OK);
        }
        return *this;
    }

    TernaryTree& get(const std::string& key, const std::function<void(TernaryTree& self, const std::string& key, void * userdata, int status)>& callback) {
        // calculate how many elements are needed to fit 
        size_t size_to_fit = ceill((key.length() + sizeof(InternalValueType))>>bit_shift) + 1;

        // allocate memory, clear and copy
        InternalValueType * data = (InternalValueType *)malloc(sizeof(InternalValueType) * size_to_fit);
        memset(data, 0, sizeof(InternalValueType) * size_to_fit);
        memcpy(data, key.c_str(), key.length());

        // search
        tstree result = tstree_search(this->tree, data);

        // callback
        if (result) {
            if (uncompressed.length()) {
                callback(*this, key, result->userdata, OK);
            } else {
                callback(*this, key, NULL, Failed);
            }
        } else {
            callback(*this, key, NULL, NotFound);
        }
        return *this;
    }
private:
    typedef struct tsnode {
        InternalValueType internal_value;
        struct tsnode * le, * eq, * gt;
        void * userdata;
    } tsnode, * tstree;
    unsigned char bit_shift;
    tstree tree = NULL;
    tstree tstree_insert(tstree tree, const char * original, InternalValueType * key, void * userdata) {
        // 判断传入的tree是否为NULL
        if (tree == NULL) {
            // 是的话,则malloc内存
            tree = (tstree)malloc(sizeof(tsnode));
            // 将这个结点储存的字符赋值为key的第一个字符
            tree->internal_value = *key;
            // 其余三个指针都为NULL
            tree->le = tree->eq = tree->gt = tree->userdata = NULL;
        }
        // 接下来比较大小
        // (即使传入的时候为NULL,我们也在上面的if中为它申请了内存,并赋了相应的值,这时自然是"相等"的状态)
        if (*key < tree->internal_value) {
            // 插入的字符小于当前结点储存的值,转向左子树继续插入
            tree->le = tstree_insert(tree->le, original, key, userdata);
        } else if (*key > tree->internal_value) {
            // 插入的字符大于当前结点储存的值,转向左子树继续插入
            tree->gt = tstree_insert(tree->gt, original, key, userdata);
        } else {
            // 插入的字符等于当前结点储存的值,判断情况
            if (*key != 0) {
                // 如果还没到达末尾,说明是有同样的前缀,向中间的子树插入下一个字符的值
                tree->eq = tstree_insert(tree->eq, original, ++key, userdata);
            } else {
                // 如果到达了末尾,那么就把key的地址赋值给当前结点的eq指针
                if (tree->eq == NULL) {
                    tree->eq = (tstree)strdup(original);
                }
                tree->userdata = userdata;
            }
        }
        return tree;
    }

    tstree tstree_search(const tstree tree, InternalValueType * key) {
        tstree root;
        root = tree;
        while (root) {
            if (*key < root->internal_value) {
                // 小于则转左树
                root = root->le;
            } else if (*key > root->internal_value) {
                // 大于则转右树
                root = root->gt;
            } else {
                // 等于的时候判断情况
                if (*key == 0) {
                    // 查询字符串已经递归走到末尾了
                    // 意味着找到了
                    return root;
                }
                key++;
                root = root->eq;
            }
        }
        return NULL;
    }
};

Now, how fast our new TST algorithm is? We did 3 tests (exactly as those tests in the Application of Ternary Search Tree)
那么现在的TST算法有多快呢? 我们做了三个测试(与Ternary Search Tree的应用中同样的测试)

Test 1, 测试一


  • 100万条数据, 1 million
  • 随机字符串 random strings
  • 固定长度 fixed length
  • 长度为4/8/16/24/32/48/64/128/192/256字节, 4/8/16/24/32/48/64/128/192/256 bytes

PUT:
1M-PUT-v2

GET:
1M-GET-v2

Memory usage:
1M-usage-v2

Test 2, 测试二


  • 1000万条数据, 10 million
  • 随机字符串 random strings
  • 随机长度,8/16/24/32字节, random length, upperbound are 8/16/24/32 bytes
  • 固定长度,4/8/16/24字节, random length, upperbound are 4/8/16/24 bytes

PUT:
10M-PUT-v2

GET:
10M-GET-v2

Memory usage:
10M-usage-v2

Test 3, 测试三


  • 3000万条数据, 30 million
  • 随机字符串 random strings
  • 随机长度,4/8/16字节, random length, upperbound are 4/8/16 bytes

PUT:
30M-PUT-v2

GET:
30M-GET-v2

Memory usage:
30M-usage-v2

As we can see from these figures, there is a obviously drop in PUT/GET time for almost all cases. Except for those key length is too short, because that's making it a linked list. And also, a gaint decrement in memory usage can be seen. Since we store mutilple char in one node, there are much less levels in TST. And the reason why the memory usage in test 3 is more than the first version is that key length is too short. If you know the distribution of your keys' length, you can always get the 'best' performance.

正如我们所见的,PUT/GET的用时都几乎呈现减少的状态,除了那些key长度太小的,因为它们使得这棵树退化为了链表。并且,这里我们也能看到内存使用率大幅下降,因为我们在一个结点中存放了多了字符数据,和原来相比现在的层数大量减少。最后,测试三中内存使用比第一版还要多得多的原因是key长度太小。如果你了解你的key长度的分布的话,你将会得到“最佳”的性能。

完整的代码可以在layerdb上找到。

Leave a Reply

Your email address will not be published. Required fields are marked *

5 − 1 =