Anti-Caching

Introduction

The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple transactions to access the database at the same time. Previous research has shown that this results in substantial overhead for on-line transaction processing (OLTP) applications.

The next generation of main memory DBMSs seek overcome these limitations with a different architecture based on main memory resident data. To overcome the restriction that all data fit in main memory, we propose a new technique that we call anti-caching. An anti-caching DBMS reverses the traditional hierarchy of disk-based systems: all data initially resides in memory, and when memory is exhausted, the least-recently accessed records are collected and written to disk.

We implemented a prototype of our anti-caching proposal in a high-performance, main memory DBMS and ran benchmarks across a range of database sizes and read/write mixes. We compared its performance with a well-tuned installation of an open-source, disk-based DBMS optionally fronted by a distributed main memory cache. Our results show that when all data fits in main memory, the anti-caching DBMS is 4x faster than the other systems. As the size of the database increases, both DBMSs degrade, but the anti-caching DBMS maintains a significant performance advantage of the disk-based systems. In fact, for larger data sizes, the performance advantage of anti-caching increases, up to 10x for a data size 2x larger than memory. Based on these results, we contend that for any OLTP application, regardless of size, our anti-caching architecture is preferred than traditional, disk-oriented systems.

Usage

Tables need to be marked as “evictable” when compiling the benchmark’s project catalog. This tells H-Store which tables are allowed to have their tuples evicted out to disk. To do this, use the evictable argument when compiling a project catalog. The expected value of this parameter is a comma-separated list of table names. Note that tables that do not have a primary key, materialized views, internal system tables, or mapreduce tables cannot be marked as evictable.

Example:

ant hstore-prepare \
  -Dproject=tpcc \
  -Devictable="HISTORY,CUSTOMER,ORDERS,ORDER_LINE"

When executing a benchmark, you must set the ${site.anticache_enable} configuration parameter to true.

ant hstore-benchmark -Dproject=tpcc \
-Dsite.anticache_enable=true \
-Dsite.anticache_dir=obj/anticache

Anti-Cache Eviction

Our description of eviction in anti-cache is broken down into two parts. The first is the architecture, i.e. the components necessary to evict tuples to disk. The second is the process of eviction, i.e. the steps necessary to evict tuples to disk. These are described in detail below.

At the core of our anti-caching design is the AntiCacheManager. It is the AntiCacheManager that is responsible for monitoring memory usage and sending eviction requests down to the tables in the EE. Tables then evict a block-sized chunk of tuples (the size of which can be configured) to the EE, updating all appropriate indexes in the process. To determine which tuples should be evicted, we implement a LRU eviction policy, the details of which are maintained by the EvictionManager. Finally, an EvictionIterator class is used to iterate the tuples in eviction-policy order.

Eviction Architecture

The front-end of component of the anti-caching architecture is controlled by one main class, the AntiCacheManager. The AntiCacheManager is responsible for periodically polling the underlying tables to determine total memory usage. Because tables are partitioned, this functionality cannot be pushed down to the EE, as no single partition has a global view of the data at the site. The AntiCacheManager polls total memory usage by queuing a distributed transaction that queries each table’s internal stats. This process is run periodically (i.e., every 30 seconds). Upon completion of this transaction, the AntiCacheManager will determine if the total current memory usage is above a tunable eviction threshold, and if so will initiate eviction.

Eviction is initiated submitting a transaction that will inform each of the partitions to start eviction. Included in this transaction, will be exactly how many bytes each partition should evict. The total amount of data to evict across the entire site with each eviction invocation is a configurable parameter, called evictBlockSize. There are several different methods to consider when dividing up the total number of bytes to evict among the various partitions. In some workloads, it might make sense to evict different size blocks from each partition. This could result from a variety of factors, including differently sized partitions or partitions receiving a larger amount of the transactions. Or, a uniform block size could be evicted from each partition, regardless or partition size or transaction frequency. This is how our current implementation works. The amount to evict from each partition is calculated by dividing evictBlockSize by the total number of partitions. The problem of choosing non-uniform eviction block sizes according to workload or data skew is beyond the scope of this paper. Once the AntiCacheManager has queued the eviction request transactions to each of the partitions, the rest of the eviction process in handled in the EE.

At each partition, there will be a subset of the tables that will be tagged as evictable by the user. In some cases, it may be undesirable to evict any portion of a table to disk. For example, consider a small lookup table that will be used by a majority of the transactions in the workload. If this table can fit comfortably in memory, it would not make sense to evict a portion of it and constantly be evicting and un-evicting it from disk. For this reason, we allow the user to specify which tables can be evicted based on the semantics of the tables. Every table that is designated as evictable will contain a special sub-table called the EvictionTable. The role of the EvictionTable is to track the location of each tuple that is evicted. This is accomplished by storing a pair containing the primary key of the evicted tuple as well as the block id of the evicted block that contains the tuple. Block IDs are unique IDs used to identify blocks evicted in the anti-cache. In our current implementation, the anti-cache is implemented using an instance of BerkeleyDB, an embedded database for storing key-value pairs. Our use of BerkeleyDB as a back-end key-value store is due largely to its wide support and ease of integration, but any lightweight key-value store would have sufficed.

In our implementation, the block id represents the "key" and the block of evicted tuples represents the "value". Because each entry in theEvictionTable stores both the tuple’s primary key and a 4-byte block id, there is a memory overhead associated with anti-caching. Given that the main goal of anti-caching is to free up memory, it is important for the memory overhead needed to implement the anti-cache does not exceed the size of the tuple, and ideally should be a good deal less than the total tuple size. Obviously, for this to be true, the table tuples must contain more data than just a primary key. We call the size of the data columns not included in the primary key the tuple payload. In practice, for many OLTP applications, tuples will contain a simple integer ID with a payload several times larger than the primary key, as is the case for the benchmarks we tested with. Thus, a payload larger than the primary key is one of the main assumptions in our anti-cache implementation and we leave a lower-memory design for when this payload assumption doesn’t hold as an area of future interest.

At each evictable table, a list is kept containing the eviction order for tuples in that table. We evict tuples in least recently used (LRU) order. LRU is by far the most commonly used eviction policy for workloads with temporal locality, and fits well with the hot/cold workload skews investigated in our benchmarks. However, as discussed above, it is important to keep the memory overhead of our anti-caching architecture to a minimum, so as not to diminish the effects of evicting tuples to disk. With this in mind, we have implemented a simple LRU chain by using the 4-byte internal table tuple IDs used in H-Store. These IDs are essentially just the tuple’s block offset in the memory blocks allocated for the table. Because this ID is per-table per-partition, 4 bytes provides plenty of unique IDs, and saves 4 bytes over the approach of storing 64 bit memory pointers. The head of the chain represents the oldest tuple in the chain, i.e. the tuple next for eviction. The tail of the chain represents the newest tuple in the chain, i.e. the tuple most recently inserted, read or updated. The chain is implemented by storing in each tuple’s header the ID of the next tuple in the chain. Each table stores the ID of the oldest (head) and newest (tail) tuple in the chain. Eviction is done by simply iterating the chain starting from the head and removing the tuples as they are evicted.

The chain is managed by the EvictionManager class. This class is responsible for adding, removing and updating tuples in the LRU chain. If a new tuple is inserted, it is simply added to the back of the chain as the newest tuple. If a tuple already exists in the chain and is read or updated, it is necessary to "promote" the tuple to the back of the chain, since it is now the most recently used tuple. To do this we first remove the tuple from its original location in the chain. Then, it is simply added to the back of the chain just as in an insertion. Because the EvictionManager is the only component responsible for organizing tuples in eviction policy order, it would be trivial to implement eviction policies other than LRU. While this is not necessary for the temporally-skewed benchmarks tested in this paper, it remains an area of interest and future work as benchmarks with different skew characteristics are identified. Another component of the eviction chain is the EvictionIterator, which simply returns tuples in eviction policy order by iterating over the chain. Like any iterator, the EvictionIterator is used as merely an accessor to the eviction chain, and is not responsible for managing or updating any part of the chain.

Eviction Procedure

As described above, each table will receive and eviction request with the number of bytes to evict, and the table just needs to iterate the table’s tuples using the EvictionIterator. However, other bookkeeping must be done to ensure we can identify that a tuple has been evicted. In particular, we need to update the indexes and create an entry in the EvictionTable.

To create an entry inEvictionTable, we copy in all the primary key values from the tuple to be evicted and add the current block id for the eviction process. This new tuple is marked as evicted using the evicted flag in the tuple header. Indexes in H-Store are simply pointers to the tuple data in the table. Thus, we just need to update all index pointers to point to this newly created evicted tuple, rather than the original tuple, which will soon be deleted. The evicted tuple shares the same header information as a normal non-evicted tuple. Because of this, when iterating an index, we can simply check whether the evicted flag is set, signifying the iterator has found an evicted tuple that will contain only the primary key fields and the block ID of the tuple on disk.

Also important to consider is the presence of non-inlined strings. The data for large variable-length strings are stored elsewhere in memory, and only a pointer is stored in the tuple. In order to fully evict the tuple and reclaim the memory, these strings must first be serialized. After the newly created tuple is inserted in the EvictedTable, the old tuple is serialized (if necessary) and copied to the current eviction block. The storage for this tuple is then deleted.

Anti-Cache Fetch

If a transaction needs a tuple that has been evicted to disk, it must be read from the anti-cache and merged with the original table in a transactionally consistent way. This process is handled by the AntiCacheManager. The AntiCacheManager will receive a list of blocks that need to be un-evicted from the EE. Note, the process of determining which blocks to un-evict is detailed below. The AntiCacheManager will then make an asynchronous call to the EE to retrieve the blocks, which is done via an interface to the back-end key value store.

Once the requested blocks are read from disk, it is necessary to merge them back into the regular table. There are numerous strategies on how exactly this should be done. One strategy is to take the entire block read from disk and add them to the head of the LRU chain, meaning they are the coldest tuples in the chain and will be next for eviction. Then, when the original transaction that requested the evicted tuple is re-executed, the tuples needed from the fetched block will be sent to the back of the chain with the hot data. The tuples fetched in the evicted block that were not needed by any transaction will remain at the front of the chain, next for eviction. The original block is deleted from the anti-cache, as now these tuples all reside in memory. We call this fetch policy block-fetch. In some sense, block-fetch is the purest anti-caching strategy, as it avoids all double buffering of data. However, if only a single tuple is needed for a block, there is clearly an overhead of merging all the tuples from the un-evicted block, only to potentially re-evict them shortly. To avoid this un-eviction/re-eviction cycle, we also propose another un-eviction strategy, called tuple-fetch, that only merges the tuples that caused the block to be read from disk, i.e. those that were needed in a transaction. The block is then discarded, without deleting the block from the anti-cache on disk. The obvious benefit with this strategy is that only the tuples requested in a transaction are merged back into the tables. The downside is that we now have two copies of un-evicted tuples, one on disk in the anti-cache and one that has been remerged with the regular table in memory. As transactions update the in-memory version, the disk-based version is now out-of-date and must be marked as invalid. If this tuple is once again evicted, the entry in the EvictionTable will not contain the new block ID, not the old block ID with the invalid tuple. Thus, while the un-eviction process will only ever access the newest version (i.e., the correct version) of the tuple on disk, the old tuples create "holes", taking up space in a disk block. Over time, the presence of these holes can accumulate, and compaction can be necessary.

If we consider the spectrum of possible un-eviction strategies once we’ve read a block from disk, block-fetch and tuple-fetch represent the two extremes. Block-fetch keeps all the data, regardless of how cold it is and relies on re-eviction of the cold data to disk. Tuple-fetch keeps only the tuples requested in a specific transaction. The middle ground of these two strategies would be to un-evict all the tuples from a given block that are likely to be needed by upcoming transactions. Unfortunately, knowing exactly what tuples will be needed is difficult to predict. Instead, what we can do is delay the fetching of individual tuples by a constant time, called fetchTime. This allows fetch requests to queue, thereby increasing the likelihood that we are fetching multiple tuples per block that is un-evicted. By varying fetchTime, the system, in theory, can vary the amount of tuples fetched per un-evicted block. The effect is that fetched tuples are essentially sharing the overhead of reading the block from disk, which is by far the major bottleneck in the system. The trade-off is the added latency associated with this stalling. However, given that many of the OLTP workloads being explored are extremely high throughput, even a small fetchTime could result in shared fetching. Indeed, the added latency is likely to be eclipsed by the cost of the disk read that these transactions are already stalled for, resulting in increased throughput and less average overall latency. We call this eviction policy group-fetch.

When a transaction executes a query, the request is sent from PartitionExecutor down to the EE. The EE will start processing the query plan nodes in execution order. If the EE is scanning an index, it will traverse the tree structure and iterate over the keys that match the query’s predicate. For sequential scans, the EE uses a special iterator that iterates over all of the non-evicted tuples followed by the EvictedTable tuples.

For each tuple, the EE checks whether the evicted flag is set to true in the tuple’s header. If it is, then it knows that the tuple is evicted and that it can get the tuple’s block id from this entry in the EvictedTable. When the EE encounters an evicted tuple, it does not evaluate additional filters on that tuple further (e.g., predicates on non-primary key columns) and continues processing the iterator. Once the processing is finished (i.e., there are no more tuples selected by the iterator), then the EE will abort the transaction and send the list of the block ids that the transaction attempted to access to the AntiCacheManager. The transaction will then get put on hold.

A separate thread in the AntiCacheManager will then make an asynchronous call down into the EE to retrieve blocks and store them in a side buffer. The transaction then gets re-queued, and then right before it starts to execute we make a synchronous call down into the EE to move the fetched in tuples from the side buffer into the table’s regular memory. All of the tuples that from the recently fetched in block are put at the end of the MRU chain. But since no other transaction will run from the time that we add the tuples to the MRU chain before we re-execute our original transaction, we know that another eviction will not take place. This means that when we re-execute the transaction, it will access the tuples that caused the fetch in the first place and those tuples will get moved to the front of the MRU chain.

Query Execution

The process of executing a query on potentially evicted tuples involves a pre-execution phase in which all the tuples needed for the current query are brought into memory. The pre-execution phase is identical to the execution of a normal transaction, except that an evicted list is created to record any tuples needed by the current transaction that have been evicted to disk. The evicted list contains both the primary key of the evicted tuple as well as the block ID of where the tuple resides on disk. If, at the end of query execution, the evicted list is empty (i.e., no evicted tuples were needed), the transaction is allowed to commit just as it would during normal execution. However, if the evicted list is not empty, the tuples in the evicted list need to be read from disk before the transaction can continue. One strategy would be to have the current transaction stall while the needed tuples in the evicted list are read from the blocks on disk. However, because H-Store uses a single-threaded execution model, this would result in this transaction blocking all the transactions in this partitions queue, some of which may not need any evicted data from disk. This contrasts with a concurrent execution model in which transaction are serialized to produce a transactionally consistent ordering from operations. Because blocking would significantly decrease overall system throughput in a single-threaded execution model such as H-Store, instead the current transaction is aborted. Before aborting, the transaction will throw an EvictedTupleAccessException, containing all of the information in the evicted list. This exception will be caught by the front-end, which will re-queue a separate transaction to asynchronously read the blocks specified in the exception from disk. Once this transaction has completed, the data from the un-evicted blocks will be merged according to one of the fetch policies discussed above (i.e. block-fetch, tuple-fetch or group-fetch) and the original transaction will be restarted. With all the evicted tuples needed now residing in memory, the transaction will be able to execute and commit as normal.

« Automatic Deployment on Amazon EC2

Command Logging »

Last Updated on April 6th, 2013 by Justin DeBrabant