DeuceDB: Smarter categorizing and scheduling to amortize background IO costs in LevelDB

Omkarbdesai
10 min readApr 9, 2021

Omkar Desai, Xiangqun Zhang, Ziyang Jiao

Abstract

We present an implementation of TRIAD, an optimization on RocksDB by a group of researchers at Swiss Federal Institute of Technology Lausanne (EPFL) and Nutanix. The paper argues that the resource utilization for performing background operations can be reduced with the help of smart categorization and better scheduling decisions. The work is divided into three individual improvements — TRIAD-MEM, TRIAD-DISK, and TRIAD-LOG, each representing an optimization of a shortcoming of RocksDB.

We present an implementation of the two most important features of TRIAD; 1) TRIAD-MEM, 2) TRIAD- DISK.

We implement our improvements on LevelDB, which is in contrast to the authors of TRIAD who implement the improvements on RocksDB. With profiling and flame graphs, we analyse the expected-impact-on-performance with the implementation of each of the feature. Finally, we present the performance of our implementation which we call DeuceDB.

1 Introduction

Key-Value(KV) stores are one widely used technique for systems needing to manage a large amount of data. Unlike rational databases, they store data as objects and have better scalability, throughput, and flexibility. Database systems based on KV store, such as LevelDB at Google, RocksDB at Facebook, and BangDB at Cisco, are widely used in online-shopping, advertising, and social networking applications.

Log-Structured Merge trees (LSMs) are one design option for KV store systems. They are currently utilized in a wide range of DBMS including LevelDB, RocksDB, cLSM, and bLSM. LSMs achieve great write throughput without hurting read performance too much. Figure 1 describe the high-level overview of a typical LSM KV store. Basically, LSMs mainly consists of two kinds of components: a memory layer and several disk layers. The memory layer is responsible for absorbing writes. In order to avoid data loss, the writes could be backed up in the commit log which is stored on disk. When the memory layer is full, data are flushed into disk layers which are organized into levels. Data are organized into SSTables for each level. When level Li is full, some SSTables within level Li will be compacted into level Li+1. Flushing and Compaction are two important operations in LSMs, which happens in background. However, they will take up a large amount of system bandwidth and therefore negatively affect performance. This problem is serious especially under highly skewed workloads. In this case, hot data could be frequently flushed, which is unnecessary. Moreover, when the commit log is full, flushing is triggered, and it is very likely there are only a small amount of data in the memory layer. For disk layers, we also want to avoid repeatedly compacting hot keys.

For TRIAD-MEM, we separate hot and cold data in the memory level and then flush only cold data in the disk in order to decrease background overheads; For TRIAD-Disk, we defer compaction until we can reclaim a fair amount of space in order to increase the efficiency of compaction.

We also compare the performance of our Triad LevelDB with the original version. We find that our version performs better under highly skewed workloads. By reducing background overheads, our work allows more resources to be used for foreground throughput. Finally, we identify some problems in the Triad paper and give our insights.

2 TRIAD Features

As the name “TRIAD” indicates, TRIAD is a collection of three parts: TRIAD-MEM, TRIAD-DISK, and TRIAD-LOG. We implement these features individually. This enabled us to test and evaluate each of the feature independently as will be explained in section 4. We later merge all the features and run tests on our implementation, “DeuceDB”.

We implemented the TRIAD-DISK and TRIAD-MEM features by adding 750 lines of code to the latest version of LevelDB.

2.1 TRIAD-MEM

TRIAD-MEM is the part to keep hot data inside the RAM, instead of flushing every single piece of written data onto the persistent storage. Many workloads show data skewness: a small portion of data is frequently accessed, but the vast majority of data are not frequently accessed. The small portion of data dominates the number of data access. Due to the nature of LevelDB (or SSTable in general), any operation against data, except reading data, is represented in write operations. This includes adding a new key-value mapping, updating the value of an existing key, and even deleting a key-value pair. This raised a question: how to ensure the efficiency of a frequently accessed key-value pair can be updated and read with a decent performance?

One intuitive solution for that is to keep hot data in the RAM. Since each update will also create a new entry on SSTable and eventually be written on the disk, hot data will create a huge number of entries, which only the latest one will be used as the “valid” entry, and all others will be discarded in the compaction process. An analogy is writing a ten-page essay, but only the last sentence is useful; to read the essay, it is enough to read the last sentence, and the rest part will be disposed into garbage bin. This is very wasteful in both CPU and disk resource. Therefore, TRIAD-MEM provided in-place update on hot keys, therefore decreasing the data size to be written onto disk. This can also result in better compaction time: because hot keys used to create the most number of entries, and only the latest entry will be kept, keeping hot keys inside RAM can significantly decrease the number of overlapping keys, thus reducing compaction time and resource, and yield to better foreground service to user.

2.2 TRIAD-DISK

TRIAD-DISK is the component that deals with compaction of data at the L0 level. Compaction is required in LSM in order to get rid of old and obsolete data in the lower levels of the LSM structure. The compaction process, even though a background process, consumes significant resources. These resource consuming background processes slow down the foreground operations of LevelDB. TRIAD-DISK delays compaction until there is enough key overlap in the files that are being compacted. To approximate the key overlap between files, we use the HyperLogLog (HLL) probabilistic cardinality estimator. HLL is an algorithm that approximates the number of distinct elements in a multiset. For every file L0 file that is scheduled for compaction with the files at the lower level (L1), an overlap ratio is calculated. Assuming we have n files on L0, the overlap ratio is defined as 1 − (UniqueKeys(file1,file2,…file-n))/(Keys(file-i)), where Keys(file-i) is the number of keys of the ith SSTable and UniqueKeys is the number of unique keys after merging the n files.

Currently, LevelDB triggers compaction of L0 into L1 as soon as the number of files on L0 reaches a certain threshold. This may not be always necessary when the workload is skewed. This is because having multiple files present in L0 increases the probability of overlap and the higher overlap amortizes the cost of compaction at L0.

Figure 2 illustrates the concept of differing compaction. In the upper part of the figure, there is only one file on L0. The L0 file overlaps with two files on L1. Since the overlap ratio (0.2) is smaller than the cutoff threshold (0.4) in this case, compaction is deferred. The lower part of the figure shows the system at a time when L0 contains two files. The overlap ratio is computed between all the files in L0 and their respective overlapping files on L1. The overlap ratio calculates is higher than the threshold and thus compaction is triggered.

For workloads that do not have any overlap, this method would result in a very large number of files at L0. As the keys in L0 files may have overlapping key ranges, search at L0 is done by searching every file in L0. Naturally, differing compaction indefinitely in a linear workload would negatively impact get() performance. To combat this, there is an upper limit to the number of files that can exist at L0, after which compaction is triggered even if the overlap ration is below the threshold.

2.3 TRIAD-LOG

In our work, we focused on TRIAD-MEM and TRIAD- DISK. Due to the lack of support to extract log information directly from the current log in LevelDB. Theoretically, TRIAD-LOG can be implemented along with the implementation of an interface to extract current log information from LevelDB’s current log. In interest of time, we decided to leave TRIAD-LOG as future work.

3 Evaluation Methodology

We run benchmarking tests using the db bench benchmark for LevelDB. We run the benchmark at different scales — 1 Million key value pairs and 5 Million key value pairs. This shows that our implementation of features proposed by the authors of TRIAD runs well at scale. It also ensures that data is flushed out of memory into the disk and compactions are triggered. We run the tests on three versions of LevelDB to understand the performance improvement achieved by each of our improvements versus the baseline of LevelDB: the original db bench as baseline, the implementation with TRIAD-MEM, and the implementation with both TRIAD-MEM and TRIAD-DISK. We do not change any configurations in options.h or db bench. The paper by Balmau et.al. implements these enhancements on RocksDB and run tests to compare against RocksDB. We do not compare our implementation with RocksDB or TRIAD as our implementation is based on LevelDB.

The hardware configuration we have used is OS-X 10.15.6 running on an Intel i9–9980HK CPU with 64GB main memory. The storage device is a 4TB Apple SSD.

We run a clean setup for every test to ensure that every version runs with the same initial configuration. It would be valuable to run tests to valuate on an already populated database, but we leave for future work.

We talk about the different enhancements implemented by us and the effects they have on performance. We then compare the expectations with the actual results while finding reasons for discrepancies, if any.

We expect the implementation with TRIAD-MEM to be faster than the original LevelDB, and TRIAD-MEM plus TRIAD-DISK should be even faster than TRIAD-MEM, which is the goal for this project. Since LevelDB does not provide support for directly reading log information, we did not implement TRIAD-LOG and therefore we do not set targets for improvements owing to TRIAD-LOG.

4 Evaluation Results

We want to set our targeted expectations on the basis of how much impact every operation has on the performance of LevelDB. In order to achieve this, profile the LevelDB system using the DTrace profiler. This gives us an estimation of how much time each function takes as a percentage of the complete execution time. For the sake of uniformity, we run the profiler with db bench. Figure 3 shows a flame graph generated using the DTrace profler. This flame graph shows the frequency of the function calls made to different functions when the db bench is run. We keep the more detailed function level profiling with our source code which is available at.

Figure 4 and 5 show the flame graphs isolated for the foreground thread that does the client facing operations and the background thread that performs the background work like compaction respectively.

Cumulatively, these flame graphs tell us that the impact of hot and cold key seperation implemented in TRIAD-MEM should significantly higher on performance as compared to the impact of implementing differCompaction in TRIAD-DISK. Although the principles in the original paper seems promising, our implementation did not show any performance improvement in most cases.

Figure 6 shows the results when we ran the db benchbenchmark on three versions — Baseline (db benchdefault), TRIAD-MEM (LevelDBwith hot and cold key separation), and TRIAD-MEM+DISK (LevelDBwith hot and cold key sepration and situation aware scheduling of compaction) with 1 million key- value pairs. For the fill100K workload, we see a significant improvement in the performance of TRIAD-MEM and TRIAD-DISK. We see a 10x improvement in the fill100K performance on our implementations and with the help of the Dtrace profiling shown before and the improvement in TRIAD-MEM, we can say that this performance improvement can be attributed to seperating the hot and cold keys in the memtable. For the read sequential workload, we observe a drop in performance for our TRIAD-MEM+DISK version. We believe that this drop in performance can be mitigated through better tuning of the maximum number of files allowed at the L0 level. We observe increased compaction time in TRIAD-MEM+DISK.

Figure 7 shows the results when we ran the db benchbenchmark on three versions — Baseline (db benchdefault), TRIAD-MEM (LevelDBwith hot and cold key separation), and TRIAD-MEM+DISK (LevelDBwith hot and cold key sepration and situation aware scheduling of compaction) with 5 million key- value pairs. For the fill100K workload, we notice a significant drop in performance on all three versions, including LevelDB. This brought to light an interesting argument. LevelDB’s write performance is affected at scale. We hypothesize that this impact is due to background operations. We are studying the results from the profiler to back our hypothesis. For the read sequential workload, we observe a similar performance profile to the benchmarks run with 1 million key-value pairs shown in Figure 6. The observation for compaction performance from this benchmark shows that we have implementation inefficiencies.

References

  1. Balmau, O., Didona, D., Guerraoui, R., Zwaenepoel, W., and Konka, P. Triad: Creating synergies be- tween memory, disk and log in log structured key-value stores. In Proceedings of the 2017 USENIX Confer- ence on Usenix Annual Technical Conference (2017). https://www.usenix.org/sites/default/files/conference/ protected-files/atc17 slides balmau.pdf.
  2. Facebook. Rocksdb. https://github.com/facebook/ rocksdb.
  3. Google. Leveldb. https://github.com/google/leveldb.
  4. Linux. Dtrace. http://dtrace.org/blogs/about/.
  5. Omkar Desai, Xiangqun Zhang, and Ziyang Jiao. Deucedb. https://github.com/swiftomkar/leveldb-triad.

--

--