Upcoming Talks

The Large-Scale Lunch is a monthly lunchtime presentation and discussion about the use of cluster, distributed, parallel, and other large-scale computing methods to solve current applied computer science research problems. The LSL is intended to help spread knowledge about how to do research that involves massive amounts of data and/or computation on modern, multiprocessor computing equipment. The ideal talk provides two types of insight:
  • insight into the basic research problem, why it is important, and how it was addressed, and
  • insight into the large-scale implementation problem, and how it was addressed.
Talks are typically informal, and are followed (or perhaps interrupted) by questions and discussion. A typical talk is about 30 minutes of content, to allow time for discussion and getting lunch. This series is sponsored by Yahoo! and open to all interested members of the SCS/ECE/etc. communities. We hope these events will be especially useful for those currently using or planning to use Hadoop on the M45 cluster. We will not be sending regular emails to these lists to inform about future events, so if you'd like to be notified of upcoming events, please join our mailing list. To subscribe, just send mail to

large-scale-request@mailman.srv.cs.cmu.edu

with message body "subscribe". Alternatively, go to the list information page :

https://mailman.srv.cs.cmu.edu/mailman/listinfo/large-scale

and subscribe there. Also, all information about upcoming events will be displayed on our website.

Distributed Asynchronous Online Learning for Natural Language Processing

Speaker: Kevin Gimpel
Date: Monday, April 12, 2010
Time: Noon
Venue: GHC 6115

Abstract:
Considerable speed-ups to machine learning problems have been achieved by two developments: distributed computing (either on multi-core or "cloud" architectures) and rapidly converging online learning algorithms. In this talk, we combine these two. Distributed computing has largely been paired with "batch" algorithms like EM and L-BFGS, in which the entire training dataset is processed once per iterative update; our approach makes more frequent online updates asynchronously, either in a pure online or mini-batch setting. Asynchronous updates can introduce error, but the approach has similar convergence guarantees to other online learning algorithms in certain settings, such as the case of online gradient-based optimization for convex objectives. We first consider this setting, and present a series of experiments exploring practical issues for a structured prediction task in natural language processing, named-entity recognition.

We also consider settings that are not yet supported by theoretical results. We apply an online version of EM (Cappe and Moulines, 2009) to two unsupervised structured learning tasks: (1) word alignment for machine translation, and (2) unsupervised part-of-speech tagging. For the former we use a model that actually has a concave log-likelihood function, while the latter fits the more common unsupervised learning scenario with a non-concave objective. In both cases we find significant speed-ups over batch algorithms with no observable problems arising from the use of asynchronous updates. In addition, we present experimental results when running asynchronous mini-batch algorithms on M45, a large cluster running the Hadoop MapReduce framework. We find that, while MapReduce is not an ideal fit for these algorithms, they do converge faster than batch algorithms on the same hardware and we expect that the MapReduce framework may become more appropriate for asynchronous learning as problem sizes continue to grow.

This is joint work with Dipanjan Das and Noah Smith.

Syndicate content