Wednesday, July 3, 2013

Attractions In West Miami - The Phone Call That Changed The Face of Big Data

Source - http://www.wired.com/
By - Klint Finley
Category - Attractions In West Miami
Posted By - Inn and Suites In West Miami

 
Attractions In West Miami

Arun C. Murthy awoke to a phone call. It was 3 a.m., and an ad-targeting application at Yahoo, where he was an engineer, was running at painfully slow speeds. The culprit: a piece of software code that tapped into the open source number-crunching platform Hadoop. Someone else had written the code, but it was Murthy’s job to fix it.

It was a nuisance, but years later, that call would result in an entirely new path for Hadoop, a software system that’s practically synonymous with the notion of “Big Data.”

Today, Hadoop underpins Facebook, Twitter, eBay, Yahoo, and countless other companies. But in 2007, when Murthy took that early-morning call, it was still obscure. A year earlier, Doug Cutting and Michael Cafarella had created the platform, on their own time, inspired by white papers published by Google in 2004, and eventually Yahoo got behind the project, putting Cutting on the payroll. The company’s search architect, Eric Baldeschwieler, had asked Murthy to work on Hadoop because he had experience with both systems software — such as operating systems and other low-level software components — and open source.

“My journey with Hadoop almost didn’t happen,” Murthy remembers. “I looked at it and said: ‘Who the hell writes systems software in Java?’”

But he joined the effort anyway, and that night in 2007, he was cursing the decision. “Why the hell was I debugging other people’s Hadoop code?” he asked himself. And then realized that the problem was bigger than that: He was dealing with an application that wasn’t really meant to run on Hadoop.

Hadoop is actually a pair of software platforms: a storage system called Hadoop Distributed File System, or HDFS, and a processing system called MapReduce. You can dump massive amounts of data into the storage system, which can be distributed across dozens, hundreds, even thousands of servers. Then you use MapReduce to break a large problem into smaller problems distributed across your cluster. That’s the power of Hadoop: you can save money using lots of cheap commodity servers instead of a few expensive supercomputers.

The problem is sometimes developers just want to pull data out of one of those clusters without running a MapReduce job. That was the case with Yahoo’s ad-targeting system, and the realization gave Murthy his first inkling that Hadoop needed another system.

He found a quick work-around to the problem at hand, then began thinking about how to solve the larger issue. He even wrote about it in Hadoop’s bug tracking system. But from 2008 to 2010, the Hadoop team decided to focus on making Hadoop more “enterprise ready” by improving security and stability. Many other systems — such as Pig and Hive, which are included with all major distributions of Hadoop — were created to make it possible to query Hadoop without writing MapReduce jobs. But they still need to go through the MapReduce system in order to run. The queries are merely translated into MapReduce jobs.

By mid-2010, the Hadoop team thought the system was in good enough shape to start its next evolution. So Murthy and developers from across the Hadoop community finally started on the issue he had raised years before. The fruits of their labor will be added to Hadoop 2.0, which adds a new component known as YARN.

YARN is a system that sits atop HDFS. It lets developers create applications that interact with HDFS without the need to route through MapReduce. In fact, MapReduce itself will actually use YARN. “Hadoop 2.0 isn’t an arbitrary number,” says Murthy, who, in 2011, co-founded the Yahoo spinoff Hortonworks, a company that sells support and services for Hadoop. “It’s the 2nd architecture for Hadoop.”

Since Murthy first identified the need for YARN in 2007, many new software systems have been created to complement Hadoop. Twitter uses Storm, a system for processing data in real-time. Yahoo recently started using Spark, a Hadoop-style distributed system that holds data in-memory. Cloudera, one of Hortonworks’ main competitors, built Impala, which significantly improves the speed of Hadoop queries.

Today, these types of systems must either use MapReduce to interact with data stored in Hadoop clusters, or build their own solution for routing around MapReduce. But Murthy says all of these projects will be able to use YARN to interact with Hadoop, if their developers so desire. This could make both Hadoop and this ecosystem of complementary big data tools more useful.

For example, the IT monitoring company Nodeable built its own integration between Storm and Hadoop called StreamReduce before being acquired by Appcelerator last year. “[YARN] is exactly the kind of software we’ll be evaluating in the near future to bridge — ease — the gap between our batch and real-time processing,” says Appcelerator’s vice president of engineering Mark Griffin.

Spark runs on HDFS, though it discards MapReduce, veering away from the official Hadoop project. But YARN would allow the two to connect. “It’s possible to run Spark without YARN if you just want a simple deployment where a fixed set of resources is given to Spark, but we also want to support YARN for users who will install that,” explains Matei Zaharia, one of Spark’s developers at the University of California at Berkeley.

YARN is already available in some distributions of Hadoop, including the Cloudera distribution. The official Hadoop 2.0 open source project is in alpha and the beta is expected soon. It will take awhile to permeate the market, but when it does, it could make a very big difference. All thanks to a 3am phone call.

No comments:

Post a Comment