Python MapReduce framework
.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png
mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop Streaming jobs.
Stable version (v0.7.4) documentation <http://mrjob.readthedocs.org/en/stable/>
_
Development version documentation <http://mrjob.readthedocs.org/en/latest/>
_
.. image:: https://travis-ci.org/Yelp/mrjob.png :target: https://travis-ci.org/Yelp/mrjob
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.
Some important features:
Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).
Write multi-step jobs (one map-reduce step feeds into the next)
Easily launch Spark jobs on EMR or your own Hadoop cluster
Duplicate your production environment inside Hadoop
$PYTHONPATH
$TZ
)mrjob.conf
config fileAutomatically interpret error logs
SSH tunnel to hadoop job tracker (EMR only)
Minimal setup
$AWS_ACCESS_KEY_ID
and $AWS_SECRET_ACCESS_KEY
$GOOGLE_APPLICATION_CREDENTIALS
pip install mrjob
As of v0.7.0, Amazon Web Services and Google Cloud Services are optional
depedencies. To use these, install with the aws
and google
targets,
respectively. For example:
pip install mrjob[aws]
Code for this example and more live in mrjob/examples
.
.. code-block:: python
"""The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if name == 'main': MRWordFreqCount.run()
::
# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on Dataproc
python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
Amazon Web Services account <http://aws.amazon.com/>
_your account page <http://aws.amazon.com/account/>
_)$AWS_ACCESS_KEY_ID
and
$AWS_SECRET_ACCESS_KEY
accordinglyCreate a Google Cloud Platform account <http://cloud.google.com/>
_, see top-right
Learn about Google Cloud Platform "projects" <https://cloud.google.com/docs/overview/#projects>
_
Select or create a Cloud Platform Console project <https://console.cloud.google.com/project>
_
Enable billing for your project <https://console.cloud.google.com/billing>
_
Go to the API Manager <https://console.cloud.google.com/apis>
_ and search for / enable the following APIs...
Under Credentials, Create Credentials and select Service account key. Then, select New service account, enter a Name and select Key type JSON.
Install the Google Cloud SDK <https://cloud.google.com/sdk/>
_
To run in other AWS regions, upload your source tree, run make
, and use
other advanced mrjob features, you'll need to set up mrjob.conf
. mrjob looks
for its conf file in:
$MRJOB_CONF
~/.mrjob.conf
/etc/mrjob.conf
See the mrjob.conf documentation <https://mrjob.readthedocs.io/en/latest/guides/configs-basics.html>
_ for more
information.
Source code <http://github.com/Yelp/mrjob>
__Documentation <https://mrjob.readthedocs.io/en/latest/>
_Discussion group <http://groups.google.com/group/mrjob>
_Hadoop Streaming <http://hadoop.apache.org/docs/stable1/streaming.html>
_Elastic MapReduce <http://aws.amazon.com/documentation/elasticmapreduce/>
_Google Cloud Dataproc <https://cloud.google.com/dataproc/overview>
_PyCon 2011 mrjob overview <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/>
_Introduction to Recommendations and MapReduce with mrjob <http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html>
_
(source code <https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob>
__)Social Graph Analysis Using Elastic MapReduce and PyPy <http://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy>
_Thanks to Greg Killion <mailto:greg@blind-works.net>
_
(ROMEO ECHO_DELTA <http://www.romeoechodelta.net/>
_) for the logo.