Speaker: Le Zhao
When: Feb 15th Monday 4:30pm
Where: GHC 4303
Abstract
MapReduce, with its open source implementation Hadoop, is among the most
frequently used parallel computing models nowadays. This 45-minute
tutorial aims to be a quick start guide to equip you with the MapReduce
mindset, so that you can quickly transform (sequential) programs into
MapReduce. We first outline advantages of the MapReduce framework,
comparing against two other simple parallel computation models. We then
introduce two alternative ways of programming in Hadoop, Hadoop
streaming and Java API. The core of the tutorial is a set of classical
use cases of MapReduce, which can be key building blocks for complex
MapReduce procedures. Two more advanced use cases include database join
and secondary sort. We showcase two real world applications:
distributed inverted indexing and PageRank calculation, built on top of
the introduced use cases. We also discuss lessons learnt from
processing large datasets, and finally list a set of existing tools
built on top of Hadoop.
Speaker: U Kang
Date: Wednesday, November 18
Time: 12:00 noon - 1:00pm
Location: Gates Hillman Complex 5222
Abstract:
In this talk, we describe PEGASUS, an open source Peta Graph Mining library which performs typical
graph mining tasks such as computing the diameter of the graph, computing the radius of each node and
finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes,
the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library,
implemented on the top of the Hadoop platform, the open source version of MapReduce.
Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components
etc.) are essentially a repeated matrix-vector multiplication. In this talk we describe a very important primitive
for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized,
achieving (a) good scale-up on the number of available machines (b) linear running time on the number of
edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V.
Our experiments ran on M45, one of the top 50 supercomputers in the world.
We report our findings on several real graphs, including one of the largest publicly available
Web Graphs, thanks to Yahoo!, with 6,7 billion edges.
Speaker: Soila Pertet
Date: Wednesday, October 21st
Time: 12:00 noon - 1:00pm
Location: Gates Hillman Complex 4405
Abstract:
Performance problems in software frameworks such as Hadoop, which
support long-running, parallelized, data-intensive computations, can
hamper cost-management efforts in cloud-computing environments. Manual
diagnosis does not scale in such environments because of the number of
nodes and the number of performance metrics to be analyzed on each
node. This talk provides an overview of our group’s research in
automated problem diagnosis (what we call "fingerpointing") for
Hadoop. We discuss three aspects of our research namely: (i) a
diagnosis approach that synthesizes resource usage data from the OS
and task-execution flows from the Hadoop logs to diagnose problems,
(ii) an automated, online diagnosis framework that transparently
extracts different time-varying data sources and implements our
diagnosis algorithms as plug-in modules, and (iii) visualization tools
for Hadoop that provide programmers insight into the execution
patterns of their jobs. Our visualization tools have been checked into
the Hadoop repository under the Chukwa project.
Speaker: Andy Carlson
Date: Friday, March 6th
Time: 12:00 noon
Location: NSH 1305
Abstract:
We will describe work on populating an ontology of categories and relations. This problem serves as a case study for semi-supervised learning of dozens of functions. Exploiting known constraints between the functions being learned allows us to couple the learning of these functions and achieve high levels of accuracy with very little human effort. We will discuss how M45 allows us to perform experiments on hundreds of millions of web pages, and present preliminary experimental results.
Speaker: Wittawat Tantisiriroj
Date: Friday, Feb 6, 2009
Time: 12:00 noon - 13:00
Location: NSH 1507
Abstract:
Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classify file systems for large clusters into two disjoint categories, those for Internet services and those for high performance computing.
In this work, we compare and contrast parallel file systems, developed for high performance computing, and data-intensive distributed file systems, developed for Internet services. Using PVFS as a representative for parallel file systems and HDFS as a representative for Internet services file systems, we configure a parallel file system into a data-intensive Internet services stack, Hadoop, and test performance with microbenchmarks and macrobenchmarks running on an Internet services cluster, Yahoo!'s M45.
Once a number of configuration issues such as stripe unit sizes and application buffering sizes are dealt with, issues of replication, data layout and data-guided function shipping are found to be different, but supportable in parallel file systems. Performance of Hadoop applications storing data in an appropriately configured PVFS are comparable to those using a purpose built HDFS.
Download Slides