Documentation > Development > Detailed Architecture Overview
The HStoreSite is the main controller for the H-Store system within a single JVM instance. The usual deployment configuration is one HStoreSite per node in the cluster. A HStoreSite corresponds to a Site in a benchmark catalog. Each HStoreSite is passed in a unique SiteId by the BenchmarkController deployment program.
The HStoreSite will listen on the port defined in its Site proc_port paramter for incoming transaction requests from clients. This transaction requests are serialized StoredProcedureInvocation, which is originally based on the VoltDB wire protocol. Each request will be processed first by the VoltProcedureListener, and then passed into HStoreSite.procedureInvocation().
Each StoredProcedureInvocation request is turned into a LocalTransaction handle. This object will store all the state information about a transaction both before and during a transaction executes. Each transaction is given a globaly ordered transaction id. The four key elements that the system needs to determine before a transaction can execute is:
- The partition to execute the transaction’s control code on (e.g., the base partition).
- The list of partitions that the DBMS thinks that the transaction will read/write data on.
- Whether the transaction is read-only.
- Whether the transction could abort.
If the StoredProcedureInvocation request is for a system procedure, then it will be marked to execute on every partition in the cluster and its base partition will be a random partition on the local HStoreSite. Otherwise, the HStoreSite will use one of several different methods for determining what partitions it will need.
Once the HStoreSite determines this information, the transaction is either (1) passed directly to its base partition’s PartitionExecutor if it is single-partitioned or (2) passed to the HStoreCoordinator if it is a distributed transaction (i.e., the number of partitions that it will read/write data on is greater than one or is not the same as its base partition).
The HStoreSite also performs other high-level operations on transactions, such as restarting it and redirecting it to execute on another HStoreSite (based on its base partition). It acts as a go-between for the HStoreCoordinator and PartitionExecutors.
The HStoreCoordinator is how each HStoreSite communicates with other HStoreSites in the database cluster. It uses Google’s Protocol Buffers with H-Store’s asychronous ProtoRPC System event loop for sending and recieving messages. The HStoreCoordinator’s RPC interface and message data structures are defined in the hstore.proto file. All of the operations are non-blocking.
The main operations of the HStoreCoordinator are:
Send a request to all partitions to notify them that a transaction needs to access them as part of a distributed transaction. The HStoreCoordinator will not release the transaction back to the HStoreSite until all of the partitions acknowledge that they are blocked for that transaction.
The transaction is requesting that a partition on another HStoreSite execute a PlanFragment of a query. This is only invoked for remote partitions (i.e., not managed by the same HStoreSite as the transaction’s base partition).
Perform the PREPARE phase of the two-phase commit protocol for a distributed transaction. This signals to a partition that a distributed transaction is finishd with that partition and will not be returning to it to execute further queries. Note that this step is skipped for aborted transactions.
The final FINISH phase of the two-phase commit protocol for distributed transactions. This is always invoked for each distributed regardless of whether it completed successfully or aborted.
Each partition is assigned one and only one PartitionExecutor. It represents the non-blocking execution loop for a single partition. It will process incoming StoredProcedureInvocation and TransactionWork requests from over the network.
When the PartitionExecutor recieves a new transaction invocation request, it first executes the transaction’s stored procedure control code. This control code then queues up one or more queries and then blocks when it dispatches the batch to the PartitionExecutor. The PartitionExecutor uses a cached BatchPlanner handle to process the query batch and calculates what partitions each query should be routed to. If this query batch needs to execute on partitions that were not originally predicted for the transaction, then an internal MispredictionException is thrown; the transaction is then aborted and restarted (with the mispredicted partitions added it to its list of partitions that it will read/write to).
If all of the queries in the batch only access data on the PartitionExecutor’s local partition, then the queries are quickly passed down into the underlying C++ ExecutionEngine for processing.
If one or more of the queries need to access data on remote partitions (either at the same HStoreSite or at a remote HStoreSite), then the PartitionExecutor will package up the query batch into Google Protocol Buffer messages and send the request to the HStoreCoordinator. The PartitionExecutor must wait until the results are returned from the remote partitions before it can proceed.
Once all of the results for the queries are collected at the transaction’s local PartitionExecutor, they are passed back to the procedure control code. The control code will then be unblocked and allowed to continue executing the rest of the transaction’s program logic.
When the transaction either finishes or aborts, the PartitionExecutor will determine whether it can send back the ClientResponse immediately. If the transaction was single-partitioned and was not executed under a speculative execution mode, then the results can be sent back immediately. If the transaction was single-partitioned but was executed speculatively, its result will be put into a queue that will be released once the distributed transaction that it is blocked on finishes. If the transaction is multi-partitioned, then the PartitionExecutor will invoke the HStoreCoordinator to begin the two-phase commit process.
To be written…
To be written…