Speaker: Kevin Gimpel
Date: Thursday, July 31st
Time: 12:00 noon
Location: NSH 3305
Abstract: The MapReduce framework was originally designed for tasks requiring relatively simple data processing of large amounts of data, such as document indexing in large collections and data mining. Here, we consider using MapReduce to distribute an application that performs computationally-expensive data processing on relatively small amounts of data, and does so iteratively, sweeping through the full data set possibly several hundreds of times. We study the problem of unsupervised natural language grammar induction; typical approaches to this problem involve running the EM algorithm on a probabilistic grammar using a few thousand unannotated sentences of length 10 words or fewer. We describe our approach to distributing the problem using Hadoop to scale up to more and longer sentences and report results in four languages. We also discuss the challenges we faced in doing so and propose an approximate form of EM that preserves runtime advantages in situations of node failure by allowing some training data to be dropped on each iteration.
Speaker: Prof. Priya Narasimhan (www)
Date: Thursday, August 28
Time: 12:00 noon - 13:00
Location: University Center - Rangos 1
Abstract: Localizing performance problems (what we call "fingerpointing") is essential for distributed systems such as Hadoop that support long-running, parallelized, data-intensive computations over a large cluster of nodes. Manual fingerpointing does not scale in such environments because of the number of nodes and the number of performance metrics to be analyzed on each node. ASDF is an automated, online fingerpointing framework that transparently extracts and parses different time-varying data sources (e.g., sysstat, Hadoop logs) on each node, and implements multiple techniques (e.g., log analysis, correlation, clustering) to analyze these data sources jointly or in isolation. ASDF is intended to run transparently to, and not require any modifications of, both the hosted applications and the middleware (e.g., Hadoop) itself. ASDF should be deployable in production environments, where administrators might not have the luxury of instrumenting applications but could instead leverage other (black-box) data or existing system logs.We describe ASDF's online fingerpointing for documented performance problems in Hadoop, under different workloads. Our preliminary results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time, and exhibits average online fingerpointing latencies of less than 1 minute with false-positive rates of less than 1%. Publications related to our fingerpointing research are available at http://www.ece.cmu.edu/~
Speaker: Andreas Zollmann
Date: Friday,October 3rd
Time: 12:00 noon - 13:00
Location: Wean Hall 4615A
Abstract: We present the new Hadoop version of the CMU Syntax Augmented Machine Translation (SAMT) System. We ported the SAMT toolkit to the Hadoop MapReduce parallel processing architecture, allowing us to efficiently run experiments evaluating a novel "wider pipelines" approach to integrate evidence from N-best alignments into our translation models. We describe each step of the MapReduce pipeline as it is implemented in the new toolkit, and show improvements in translation quality by using N-best alignments in both hierarchical and syntax augmented translation systems.
Speaker: Joseph Gonzalez
Date: Thursday,November 13th
Time: 1:30pm - 2:30pm
Location: Newell-Simon Hall 3305
Abstract: As computer architectures move towards parallelism we must build a new theoretical understanding of parallelism in machine learning. We will present our recent work on parallel inference in graphical models. In this ongoing work, we show the theoretical limits of parallelism in belief propagation and prove that Map-Reduce Belief Propagation is asymptotically slower than even the optimal parallel algorithm. We introduce our new Residual Splash asynchronous belief propagation algorithm which achieves the lower bound on certain graphical models. In addition, we will show the results of applying our Residual Splash algorithm to the real world task of 3D-video from monocular video. Appropriate eye wear will be provided.