Hey guys,
Amazon recently announced a new service - Amazon Elastic MapReduce. This new service is basically an environment where Hadoop - the popular grid computing framework - is installed and ready to run jobs.
All you really need to do is write your job and choose how many instances you want in your grid.
I wrote an involved sample and write-up for this service based on the Freebase data set, but this service is so easy to use I wanted to get another sample written as well.
A common use for grid computing frameworks like Hadoop is to use them to index web sites. It is commonly believed that Google uses Hadoop (or a framework like Hadoop) to create their massive search indexes.
I decided to do something similar on a much smaller scale. Like many people in technology, I like to read TechCrunch. Part of the fun in reading that site is going through the comments; it is a very lively site, so there are literally dozens of comments per article.
I wanted to get an idea of who is posting the most comments; this kind of question is perfect for Amazon Elastic MapReduce.
The first thing I did was to use HTTrack to download TechCrunch.com to my local file system. I hope they didn’t mind the extra traffic - as far as I can tell HTTrack is pretty good about following robot guidance, so hopefully it wasn’t too much of a nuisance.
I actually started up a Windows EC2 instance and ran HTTrack from there; EC2 instances get to use Amazon’s network insfrastructure so the download was pretty fast. TechCrunch is a huge site, so it took about 8 hours to get the entire site down.
HTTrack downloads everything and replicates directory structures so that each link will work. This is great, but Amazon Elastic MapReduce (actually Hadoop) cannot traverse directories, so I needed a flat layout.
I wrote a small program to traverse into each directory created by HTTrack and look for HTML files. Each file that was found was given a unique name (a GUID) and copied into a flat directory.
Amazon Elastic MapReduce uses S3 for its input and output. So, my next step was to take my flat list of uniquely named HTML files and copy them into an S3 bucket. I used S3 Organizer to do that. I ran this from my Windows EC2 instance as well since there were so many files - it took a few hours for the upload to finish.
With my files in S3, the fun with Amazon Elastic MapReduce starts. There are 2 ways to write jobs in Amazon Elastic MapReduce (and in Hadoop) - you can use streaming, or you can write a Java JAR.
I find streaming to be much easier. All you do is read from standard input and write to standard output; as a friendly reader pointed out, Amazon Elastic MapReduce supports a number of different languages.
Develop your data processing application authored in your choice of Java, Ruby, Perl, Python, PHP, R, or C++. There are several code samples available in the Getting Started Guide that will help you get up and running quickly.
In my experience, I’ve only used Ruby and Python for streaming jobs and they both worked well. I had some trouble with PHP during the private beta and have not had a chance to try it again.
I’m going to use Ruby for this example. The basic structure of a streaming job looks as shown below:

Amazon ElasticMap Reduce is going to traverse the S3 bucket location we give it and give our code the contents of each file. Amazon ElasticMap Reduce will decide how many instances of our code it needs. We don’t have to worry about that in our code.
The code we are writing is called the mapper - essentially, we are taking raw input and outputting a structured format. Another chunk of code - called a reducer - will handle this structured output.
So what will our input look like? Since we are just handling the HTML files, our input will literally be the HTML pages that make up TechCrunch.com. As an example, let’s have a look at the page that announced Amazon Elastic MapReduce (http://www.techcrunch.com/2009/04/02/with-hadoop-amazon-adds-a-web-scale-file-system-to-its-cloud-computer/)
We are interested in comment authors, so scroll down to the Comments section. I am using FireBug - a really handy tool for analyzing web pages - and it lets me right-click on an element and view its source.

You can of course do a complete view source for the page, but that does not take into account dynamically generated (a la AJAX) source. FireBug works from the document object model, so it is able to show you AJAX-generated code.
Anyways, in our particular case I don’t think there is any dynamically generated HTML so it does not matter; but FireBug is a good tool to keep in mind.
The result of our inspection is 
We can see that our comment author value is contained in span tag that has a CSS class of comment_author vcard. In this particular case, there is a link to the author’s blog.
There are actually 3 variations in how the comment author is shown:
There name can just be shown with no link
Our mapper code will handle each of these cases. Ruby has pretty convenient Regular Expressions, so we can define the patterns we are looking for ahead of time.

The first pattern finds a comment author; the second pattern helps us narrow down whether we have a link in the name.
I’m not a regular expressions expert, so there may be a more efficient way to describe the pattern; with regular expressions there always seems to be a be
The first part of our code is quite simple. We will take each line of input that we are given and try to match it with our regular expression.
If we get a match, we have a comment author.

If we have a comment author, then we want to check to see if there is a link to the name (case #1 and #2) or if it is just the name (case #3).
We are going to test for case #3, that is the easiest to handle

If this if statement evaluates to true, then we are just going to print out some structured output

We will talk more about why the output looks that way. For now, our mapper code looks as shown below.

Now let’s handle case #1 and #2. All we are doing is using our second regular expression to pull out the value behind the link. You’ll notice a little bit of a hack where I am removing the ending fb:name element. I couldn’t find an easy way to express that with a regular expression, so it seemed easy enough to just remove it afterwards.

So, in both cases, we end up printing out a value like LongValueSum:Eric Lee 1. This is the output of the mapper.
In Amazon Elastic MapReduce (and in Hadoop in general), we process our data by using mappers and reducers. A simple explanation of how these work together is that the mapper makes some sense of the raw data, but it does not do any correlation. The reducer takes the output of the mapper and does the correlation.
In our case, our mapper is ‘making some sense of the data’ by looking for the comment author values. There are lots of pages in TechCrunch that either do not have comments, or are simply not blog posts. This input gets filtered out by our mapper since nothing in it matches our search terms.
We can write a reducer, but Amazon Elastic MapReduce has a few standard reducers - one of which is called aggregate. The aggregate reducer takes output like LongValueSum:Eric Lee 1 and correlates it for us.
For example, if we were to parse through all of TechCrunch.com, we will have authors who have made several comments. So, our mapper will produce output like:
LongValueSum:Eric Lee 1
LongValueSum:Eric Lee 1
LongValueSum:Eric Lee 1
LongValueSum:Eric Lee 1
Remember, it is not the job of the mapper to do any correlation; just output structured data. If we use the default aggregate reducer, it will take this output and correlate it for us to produce:
LongValueSum:Eric Lee 4
I’m pretty lazy so I try to design my map reduce usage to use the default aggregate reducer as much as possible. It seems like most of the time, that type of correlation is really what you want anyways.
At this point we have our raw data from HTTrack, we have our mapper and we know we are going to use the default reducer. Let’s do a quick test before we send all of this to Amazon Elastic MapReduce.
Of course, we can test our code by installing Hadoop on our local system and running a job. But, one of the reasons that I like streaming jobs so much is that they are easy to test; to do a simple sanity test, we don’t need a local installation of Hadoop.
For testing purposes, I downloaded a few of the HTTrack files to my local system.

Amazon Elastic MapReduce is going to take each of these files and send them to my mapper. I can mimic this by using cat (on Mac/Linux) or type (Windows).
![]()
If I run this command, all of my files will be outputted to standard output; now we just have to point this to our mapper. We can do that by using the pipe (|) command.
![]()
This simulation is good enough for a quick sanity test of our mapper - if we run it, we should see our structured output.

In the little snippet above, we can see a few cases where our reducer will do some correlation for us. Your results will vary based on your test data.
At this point we are ready to use Amazon Elastic MapReduce.
The first step is to upload our code into S3 - I use S3 Organizer to create some sub-directories to help me organize my code.

This is where I ran into a little confusion. As you might know, in S3, there is no such thing as hierarchies. All buckets are essentially peers and in each bucket there are objects. There is no such thing as buckets within buckets.
But, S3 Organizer allows you to use a / in your object names to mimic a directory structure in the UI. This is purely a UI trick - in reality, your S3 data is all flat.
What can be a little confusing is that all of the samples in Amazon Elastic MapReduce use this nomenclature. It’s actually pretty handy once you get used to it, but it is a little confusing at first.
Creating a job in Amazon Elastic MapReduce is really easy. First, you go to the console https://console.aws.amazon.com/elasticmapreduce/home
Sign-up for the Amazon Elastic MapReduce service if you haven’t already done so.
Once you’ve signed in, create a new job by clicking the Create New Job Flow button.

This brings up a dialog box that will define your new job.
The name of the job does not really matter, pick anything you want and press the Continue button.
The next page is where you configure your job; take a deep breath at this point. The data you are entering is not difficult, but you can’t make any typos here. There is limited checking in the dialog, you really just have to get all of the values right.
The first piece of data is the input location - this is the S3 bucket where your raw HTTrack flat files are. I use S3 Organizer to navigate to the location and then cut and paste the value.

Amazon Elastic MapReduce does not want the leading /, so the value I put into the input location is learnaws-mapreduce/data/techcrunch/
Make sure to enter your value exactly as shown in S3 Organizer (minus the leading slash); this includes the trailing slash.

I know I’m probably beating you over the head about not making any typos - the reason is that when you start your job, there is a fair bit of overhead that Amazon Elastic MapReduce will go through. This overhead usually takes about 5 minutes to complete. There are a lot of typo-related errors that are not caught until after this overhead is complete - so you might have to wait 5 minutes before your job fails because of a simple typing mistake.
The second location is your output location. This is just a bucket location in S3 that you have created. The trick here is that it must be a location that does not exist yet. Amazon Elastic MapReduce will not overwrite your output, so if you give it a location that already exists, it will fail.
Using S3 Organizer again, here is how I organize my output.

As you can see, I have a few output locations already (i.e. tc_comments_3, tc_comments_2, etc). The value I specify in the dialog will be a new location (i.e. tc_comments_4).
It makes sense to double-check your values at this point. The input location has to exist, the output location cannot exist yet. Both location paths should have a trailing /.
OK, enough nagging about typos
now we specify the location of our mapper code. It is just a direct path to our S3 object.

Our value in the dialog box should be:
Lastly, we specify our reducer. If we wrote our own reducer, then we would just give it a S3 path to the code. But since we are using a default one, we just specify the name - aggregate

That’s all the data we need for our job - one more time, double check your values and then press the Continue button.
The next page lets you choose how big of a cluster you want to process your data; feel free to keep the default or choose your own value. I used 8 instances of the default small variety to process the TechCrunch data. The processing took about 2 hours to complete.
Press the Continue button when you have chosen your setup. This should bring you to the last page of the dialog box; press the Create Job Flow dialog to create and run your job.
If all goes well, your job should complete in a couple of hours.
Amazon Elastic MapReduce and Hadoop are hiding a fair bit of complexity from us. Our small mapper Ruby script will be instantiated a number of times across the number of machines we chose. Hadoop figures out the optimal number of mappers to run in parallel based on the input we give it. Amazon Elastic MapReduce handles the work of instantiating the number of machines (EC2 instances) behind the scenes for us. Since there are a number of mapper instances writing to standard output, it helps to have a low-latency file system to handle all this output. Hadoop can use a few different distributed file systems (including its own). But, Amazon Elastic MapReduce configures Hadoop to use a S3-based distributed file system.
Again, all of this complexity is hidden from us. We can just sit back (or walk away) and let the processing complete.

For some funny reason, I felt some odd sense of accomplishment in keeping 8 computers busy for about 2 hours
The data that will be generated is stored at your output location in S3.

I did some experimentation with running another job to process this output and store it into SimpleDB. It was an interesting exercise, but ultimately I decided it was wrong for this example.
I downloaded these results to my local file system - there really isn’t all that much data once it has been correlated. Here is a snippet from one of those files.

You can see that our reducer has done the correlation for us - there are some authors with 100+ comments!
Now the question becomes - how to visualize this data? I’m sure there are many different ways of visualizing a large data set. I chose a relatively simple one.
I decided to use a tag cloud type of interface.

I found one to use at http://www.phpclasses.org/browse/package/4158.html written by a gentleman named Er. Rochak Chauhan.
This code can be configured to read from a CSV file; the format of the CSV is tag, url, count
This is really close to the format that I already have from Amazon Elastic MapReduce; basically all we need to do is replace the TAB with a comma and put in a URL.
I used another Ruby script to do this.

I didn’t feel like doing anything too fancy - all this script does is take our Amazon Elastic MapReduce output from standard input, convert the tabs to commas, put in a blank URL and spit the result to standard output.
I ran this script as:
![]()
This produces 1 giant CSV file that we can use with the tag generator.
The tag generator is well written and easy to understand. It is PHP-based, so we just need to run it on a web server that supports PHP. I use XAMPP on the Mac.
The only changes we need to make are to copy the CSV file to the same location as example.php and change the name of the CSV file being referenced.

Then, we can request this page (i.e. http://localhost/comments/example.php) to produce our tag cloud (click on the image below to see it full size).
To summarize our application, we did the following:
That’s about it!
Amazon Elastic MapReduce makes doing this type of processing really easy - if you are using a streaming job, you are really just reading from standard input and printing to standard output. You can request as big of a cluster as you need, so no job is literally too big.
Everything we did from the web-based UI can also be done from an API. Like all of the Amazon Web Services, this new service has an XML web service based API. A Ruby client is included in the resources to get you started right away. Check out http://aws.amazon.com/elasticmapreduce for more details.
Thanks!
Eric.
April 8th, 2009 at 6:21 pm
FANTASTIC!
April 9th, 2009 at 6:56 am
Not that I’m impressed a lot, but this is more than I expected when I found a link on SU telling that the info is awesome. Thanks.
April 9th, 2009 at 7:22 am
Thanks! I’m still exploring MapReduce and wanted to share what I’ve found so far.
April 9th, 2009 at 7:25 am
Thanks!
April 15th, 2009 at 6:49 am
Great Tut, thanks!
Couple of questions: You don’t seem to explain (unless I somehow missed it!) what the LongValueSum part of the map function output is for. I assume its for the built in aggregate function? Can you shed some light on what this is.
Also, I’ve been trying to find some more detail about this built in aggregate function on the net and seem to come up dry. (I chalk this up to my poor Googling skills - or the newness of the AWS mapreduce offering!) Do you have any good links you can post?
Thanks!
-Rob
April 15th, 2009 at 6:56 am
Oh - one other thing: You mention:
“Amazon Elastic MapReduce supports Ruby and Python for streaming jobs.”
but AWS says:
“Amazon Elastic MapReduce allows you to implement data processing applications in many languages including Java, Perl, Ruby, Python, PHP, R, or C++. You can test these applications on different instance types and job flow sizes to pick the optimal performance settings for your specific case.”
Can I assume that I can write a Perl script as my mapper/reducer in addition to the two you mention??
Thanks,
-Rob
April 15th, 2009 at 9:32 am
Yup, LongValueSum is part of the Aggregate package in Hadoop.
I haven’t found a lot of great links for it either, the best one is the Java documentation: http://hadoop.apache.org/core/docs/r0.15.2/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html.
When I get a chance, I’ll do another post and go through these different options.
April 15th, 2009 at 9:33 am
Ah you’re right, let me edit that post to say ‘from my experience…’
I tried PHP during the private beta and had some problems; I don’t know if they fixed it since. I’m sure C++ would work. The MapReduce images seem to be pretty specific in their configuration, so it’s worth doing a little experiment with Perl before going to far in your implementation.
April 22nd, 2009 at 11:07 am
[...] I ran across a blog post over at the Learn AWS blog where Eric Idontgivemylastname gives a great little tutorial of how to use Hadoop on AWS (aka Amazon [...]