{"id":1185,"date":"2012-01-08T00:26:19","date_gmt":"2012-01-08T05:26:19","guid":{"rendered":"http:\/\/hstore.cs.brown.edu\/?page_id=1185"},"modified":"2012-03-14T11:50:04","modified_gmt":"2012-03-14T15:50:04","slug":"mapreduce","status":"publish","type":"page","link":"https:\/\/hstore.cs.brown.edu\/documentation\/deployment\/mapreduce\/","title":{"rendered":"MapReduce Transactions"},"content":{"rendered":"
This following document describes how to use new experimental MapReduce-style stored procedures to execute distributed, analytical (OLAP) queries in H-Store.<\/p>\n
<\/a><\/p>\n MapReduce-style transactions allow for H-Store to execute analytical queries that access the entire database without having to incur the cost of distributed transaction coordination. When executing normal distributed transactions (i.e., transactions that need to access multiple partitions), H-Store blocks other transactions that are running or waiting in the queue on other partitions and send all data it need to base partition. This has been shown to have a significant impact on the throughput of the whole system. With H-Store’s MapReduce transactions, the transaction is split into many single-partition transactions that will independently execute in parallel on all partitions. Although it is invoked as a distributed transaction that blocks all partitions, the PartitionExecutor<\/b> will continue to execute non-MapReduce single-partition transactions and better performance has been proven.<\/p>\n When a MapReduce transaction is invoked, the following sequence of phase occurs:<\/p>\n Note that H-Store will automatically partition the shuffle data using the first key of the Map’s emit table.<\/p>\n <\/a><\/p>\n Just as with regular transactions in H-Store, all MapReduce transactions must be pre-defined as stored procedures in the benchmarks. To create a new MapReduce stored procedure, one must create a new Java class that extends the VoltMapReduceProcedure<\/a> abstract class. Users must implement map and reduce these two abstract functions, these will be talked about next.Each implementation of VoltMapReduceProcedure<\/tt> needs to include the following six<\/b> components:<\/p>\nOverview<\/h2>\n
\n
\n The transaction’s running at the base partition notifies all other partitions that to start the Map phase. The ExecutionSite<\/b> at each partition will then invoke a single-partition transaction that executes the MapInputQuery<\/a> that retrieves data from its local storage. These transactions run separately and do not need to coordinate with each other. These records are then passed to the Map()<\/a> method of the MapReduce stored procedure.<\/p>\n
\n After the single-partition Map transaction finishes, it will automatically begin the shuffle phase in a separate, non-blocking thread. Data that has the same key which is defined in MapReduce stored procedure will be sent to the same destination (i.e. partition) from MapOutputTable<\/em> to the input of the Reduce()<\/a> method.<\/p>\n
\n After sending all the data to its destination, it will move to Reduce phase where the tranaction’s running at the base partition notifies all other partitions that to begin the Reduce phase. This is similar to Map phase when ReduceInputTable<\/em> is prepared ready. Tuples in ReduceInputTable will be sorted by the key which is the first column by default. Tuples with the same key are then passed to the Reduce<\/em> method of the MapReduce stored procedure. The output of the Reduce method at each partition is coalesced at the base partition for the transaction and sent back to the client.\n<\/ol>\nMapReduce API<\/h2>\n
\n <\/a><\/p>\n
\n Before the definition of the class, it is a must to define the MapInputQuery ProcInfo. Internal system will read this name and match with it to know the user defined query next. It should be defined in MapReduce Stored Procedure like next:<\/p>\n\n