Speaker: Le Zhao
When: Feb 15th Monday 4:30pm
Where: GHC 4303
Abstract
MapReduce, with its open source implementation Hadoop, is among the most
frequently used parallel computing models nowadays. This 45-minute
tutorial aims to be a quick start guide to equip you with the MapReduce
mindset, so that you can quickly transform (sequential) programs into
MapReduce. We first outline advantages of the MapReduce framework,
comparing against two other simple parallel computation models. We then
introduce two alternative ways of programming in Hadoop, Hadoop
streaming and Java API. The core of the tutorial is a set of classical
use cases of MapReduce, which can be key building blocks for complex
MapReduce procedures. Two more advanced use cases include database join
and secondary sort. We showcase two real world applications:
distributed inverted indexing and PageRank calculation, built on top of
the introduced use cases. We also discuss lessons learnt from
processing large datasets, and finally list a set of existing tools
built on top of Hadoop.