The UC Berkeley Reliable Adaptive Distributed Systems Lab (RAD Lab) was started in January 2006 with the backing of major industry partners Sun Microsystems, Google, Inc., and Microsoft Corp., and has enjoyed the support of many additional industrial partners including VMware. Fast virtualization has been a key technology enabler for Cloud Computing, which has brought cheap, large-scale datacenter capacity within the economic reach of millions of programmers. We discussed our enthusiasm for Cloud Computing as well as some serious challenges for its success in our widely-read white paper, "Above the Clouds: A Berkeley View of Cloud Computing".
Of course, inexpensive access to cloud computing is only one ingredient for accelerating innovation in Internet services. It's true that some of the most influential and paradigm-changing Internet applications started as a single inventor with a great new idea: Pierre Omidyar prototyped the initial version of eBay over a long weekend, and Marc Zuckerman created Facebook as a way to stay in touch with his college friends after graduation. Yet such applications quickly become so popular that they end up being completely overhauled and rebuilt, sometimes multiple times, because the original architecture couldn't keep up with demand. The pace of innovation would be faster if every great new app didn't require a large company to be built in order to scale up the app!
So the RAD Lab's 5-year "moon shot" mission statement is to develop the technology that would allow a single person with a great idea to develop, deploy, and grow their new "killer app" to millions of users. Our original bet, which has been borne out, was that Statistical Machine Learning (SML) would be a key ingredient in attacking the problems of performance prediction, problem detection and identification, and resource provisioning that goes along with operating large-scale services. SML is a vigorous research field that has received high-profile attention in the last decade because of its applications in Internet search, spam detection, intrusion detection and recommendation systems. Yet it seemed less attention was being paid to how SML could improve the infrastructure and operations side of applications as well. Dave Patterson and I, two of the lab's founders, had seen early potential in this approach from their previous work on Recovery-Oriented Computing.
Four years after the lab's founding, we are on track to prototype an SML-enabled datacenter that demonstrates SML's success at providing actionable information to operators with no SML expertise. We're pleased at the opportunity to describe some of these technologies on GoVirtual. All the work listed below is described further on each individual's GoVirtual profile pages, along with links to recent papers and publications. We expect to make all of our software available under BSD-style licenses before the project ends in January 2011. Thanks for your interest in our work -- you can read more about the lab at http://radlab.cs.berkeley.edu.
CONSOLE LOG ANALYSIS. Every developer puts "printf-style" logging statements in her code, and applications built from many developers' components generate millions of lines of console logs each day. Yet the logs are nearly unintelligible to humans, not just because of their size but because each components' log printing statements were chosen to make sense only to that component's developer. And because logs usually consist of unstructured text mixed in with variable values, they're not particularly machine-friendly either.
Wei Xu has developed a methodology that combines console logs and application source code, uses SML to analyze the logs, and can distill 20 million lines of logs into a one-page decision tree that helps a system operator with no SML expertise to identify the most important log messages that may contain a clue to the cause of an operational problem.
PERFORMANCE PREDICTION AND RESOURCE PROVISIONING. A game-changing aspect of cloud computing is the ability to release unused resources and simply pay as you go for the resources you need. But this begs the question of knowing what those resources are. If you add a new feature to your app, how will that change the mix of resources (CPU, memory, disk) needed to maintain your service-level agreement targets? If you run expensive analytics on a larger parallel database, how much speedup can you expect?
Archana Ganapathi has shown that many questions like these can be answered by recasting them as problems in correlation analysis, a subfield of SML with a rich literature. She has explored the use of a state-of-the-art correlation analysis algorithm, Kernel Canonical Correlation Analysis (KCCA), to predict the performance and resource needs of database queries and Map/Reduce (Hadoop) cloud computing jobs as well as to help automatically tune high-performance code to perform better on multicore parallel processors.
SCALABLE STORAGE. All the cloud computing in the world doesn't simplify the problem of scaling up database storage; it's the proverbial brick wall that all prototype Internet apps eventually run up against. A well known trick is to relax strict data consistency in exchange for better scaling, but this usually has to be done in an ad-hoc manner since most conventional database systems don't give the developer much flexibility in expressing this tradeoff directly.
Michael Armbrust has a working prototype for Scalable Consistency-Adjustable Data Storage (SCADS) that lets programmers specify allowable tradeoffs of consistency/scalability up front. Its SQL-like query language, PIQL (Performance Introspective Query Language), allows specifying powerful yet "performance-safe" queries that won't become scalability bottlenecks. [Kristal Curtis] is working to combine PIQL's query planner with Archana's performance prediction techniques to produce runtime predictions of whether a given query will meet an SLA.
THE DIRECTOR. Scalable storage is tricky enough given the consistency/scalability tradeoffs, but cloud computing adds another twist---a direct economic incentive to scale your database _down_ as well as up, to save money or energy during periods of less-than-peak usage. But which database shards or replicas should be consolidated when, and as demand starts to increase, which shards should be split or replicated to keep up with demand while minimizing total costs? Conventional wisdom calls for "load balancing" your database evenly across available machines, but sometimes optimum usage results from identifying only the specific "hot" data items and selectively replicating those, or from consolidating items that are no longer "hot" onto a subset of available servers and then releasing the unused servers.
Making those decisions is the job of Peter Bodik's Director, the "nerve center" that monitors performance and resource availability and uses SML models to make short- and medium-term predictions about the workload. These models are used to make decisions about scaling the database both up and down in response to changing workload, deciding when and where to replicate or consolidate and accounting for the costs of data copying. The Director allows SCADS to become self-managing and deployable on Cloud Computing.
|||||