Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that's simple and concise. Unlike Pig, Scalding is written in pure Scala -- which means all the power of Scala and the JVM is already built-in. No more UDFs, folks!
At Twitter, our mission is to instantly connect people everywhere to what's most meaningful to them. With over a hundred million active users creating more than 250 million tweets every day, this means we need to quickly analyze massive amounts of data at scale.
That's why we recently open-sourced Scalding, an in-house MapReduce framework built on top of Scala and Cascading.
In 140: Instead of forcing you to write raw
reduce functions, Scalding allows you to write natural code like:
Simple to read, and just as easily run over a 10 line test file as a 10 terabyte data source in Hadoop!
Like Twitter, Scalding has a powerful simplicity that we love, and in this post we'll use the example of building a basic recommendation engine to show you why. A couple of notes before we begin:
- Scalding is open-source and lives here on Github.
- For a longer, tutorial-based version of this post (which goes more in-depth into the code and mathematics), see the original blog entry.
We use Scalding hard and we use it often, for everything from custom ad targeting algorithms to PageRank on the Twitter graph, and we hope you will too. Let's dive in!
Imagine you run an online movie business. You have a rating system in place (people can rate movies with 1 to 5 stars) and you want to calculate similarities between pairs of movies, so that if someone watches The Lion King, you can recommend films like Toy Story.
One way to define the similarity between two movies is to use their correlation:
- For every pair of movies A and B, find all the people who rated both A and B.
- Use these ratings to form a Movie A vector and a Movie B vector.
- Calculate the correlation between these two vectors.
- Whenever someone watches a movie, you can then recommend the movies most correlated with it.
Here's a snippet illustrating the code.
Notice that Scalding provides higher-level functions like
group for you (and many others, too, like
filter), so that you don't have to continually rewrite these patterns yourself. What's more, if there are other abstractions you'd like to add, go ahead! It's easy to add new functions.
Let's run this code over some real data. What dataset of movie ratings should we use?
My review for 'How to Train Your Dragon' on Rotten Tomatoes: 4 1/2 stars >bit.ly/xtw3d3— Benjamin West (@BenTheWest) February 21, 2012
People love to tweet whenever they rate a movie on Rotten Tomatoes, so let's use these ratings to generate our recommendations!
After grabbing and parsing these tweets, we can run a quick command using the handy scald.rb script that Scalding provides.
And minutes later, we're done!
As we'd expect, we see that
- Lord of the Rings, Harry Potter, and Star Wars movies are similar to other Lord of the Rings, Harry Potter, and Star Wars movies
- Big science fiction blockbusters (Avatar) are similar to big science fiction blockbusters (Inception)
- People who like one Justin Timberlake movie (Bad Teacher) also like other Justin Timberlake Movies (In Time). Similarly with Michael Fassbender (A Dangerous Method, Shame)
- Art house movies (The Tree of Life) stick together (Tinker Tailor Soldier Spy)
Just for fun, let's also look at the movies with the most negative correlation:
The more you like loud and dirty popcorn movies (Thor) and vamp romance (Twilight), the less you like arthouse? Sounds good to me.
Check-in similarities with Foursquare
Scalding also makes it easy to abstract away our input format, so that we can grab data from wherever we want. Tweets, TSVs, MySQL tables, HDFS -- no problem! And there's no reason our code needs to be tied to movie recommendations in particular, so let's switch it up.
For example, let's say we want to generate restaurant or tourist recommendations, and we have a bunch of information on who visits each location.
I'm at Empire State Building (350 5th Ave., btwn 33rd & 34th St., New York) 4sq.com/zZ5xGd— Simon Ackerman (@SimonAckerman) February 8, 2012
Here, we simply create a new class that scrapes tweets for Foursquare check-in information...
...and bam! Here are locations similar to the Empire State Building:
Here are places you might want to check out, if you check-in at Bergdorf Goodman:
And here's where to go after the Statue of Liberty:
Learn more about Scalding
Hopefully this post gave you a taste of the awesomeness of Scalding. To learn more:
- Check out Scalding on Github.
- Read this Getting Started Guide on the Scalding wiki.
- See a longer version of this post here.
- Follow @Scalding on Twitter!
-Edwin Chen(@edchedch). A huge shoutout to Argyris Zymnis (@argyris), Avi Bryant (@avibryant), and Oscar Boykin (@posco), the mastermind hackers who have spent (and continue spending) unimaginable hours making Scalding a joy to use.