Hadoop Map Reduce Python

Share

Hadoop Map Reduce Python

Hadoop MapReduce is a programming model and framework for processing large-scale data in a distributed and parallel manner, primarily associated with the Hadoop ecosystem. While MapReduce traditionally uses Java as its primary programming language, you can also write MapReduce programs in Python, thanks to projects like Hadoop Streaming and MRJob.

Here’s an overview of how you can use Python with Hadoop MapReduce:

  1. Hadoop Streaming:

    • Hadoop Streaming is a utility that allows you to create and run MapReduce jobs with any programming language that can read from standard input and write to standard output.
    • To write a MapReduce program in Python using Hadoop Streaming, you need to create two Python scripts: one for the mapper and another for the reducer.
    • The mapper script reads input data, processes it, and emits key-value pairs to standard output. The reducer script takes the output of the mapper and performs aggregation or further processing.
    • You can use command-line tools to submit Hadoop Streaming jobs, specifying the Python scripts as the mapper and reducer.
  2. MRJob:

    • MRJob is a Python library developed by Yelp that simplifies the creation and running of MapReduce jobs on Hadoop clusters. It abstracts many of the complexities of Hadoop and provides a Pythonic way to define and run MapReduce jobs.
    • With MRJob, you can write your MapReduce logic in Python classes, making it more readable and maintainable.
    • MRJob can run on Hadoop clusters, Amazon EMR, and local clusters for development and testing.

Here’s a basic example of a Python MapReduce program using Hadoop Streaming:

Mapper Script (mapper.py):

python

#!/usr/bin/env python

import sys

# Read input and split into words
for line in sys.stdin:
words = line.strip().split()
for word in words:
# Emit key-value pair (word, 1)
print(f"{word}\t1")

Reducer Script (reducer.py):

python

#!/usr/bin/env python

import sys

current_word = None
word_count = 0

# Read input and aggregate word counts
for line in sys.stdin:
word, count = line.strip().split(‘\t’)
count = int(count)

if current_word == word:
word_count += count
else:
if current_word:
# Emit result (word, total_count)
print(f”{current_word}\t{word_count})
current_word = word
word_count = count

# Don't forget the last word
if current_word:
print(f"{current_word}\t{word_count}")

To run this MapReduce job using Hadoop Streaming, you can use a command like:

bash
$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-mapper mapper.py \
-reducer reducer.py \
-input input.txt \
-output output

 

Hadoop Training Demo Day 1 Video:

 
You can find more information about Hadoop Training in this Hadoop Docs Link

 

Conclusion:

Unogeeks is the No.1 IT Training Institute for Hadoop Training. Anyone Disagree? Please drop in a comment

You can check out our other latest blogs on Hadoop Training here – Hadoop Blogs

Please check out our Best In Class Hadoop Training Details here – Hadoop Training

💬 Follow & Connect with us:

———————————-

For Training inquiries:

Call/Whatsapp: +91 73960 33555

Mail us at: info@unogeeks.com

Our Website ➜ https://unogeeks.com

Follow us:

Instagram: https://www.instagram.com/unogeeks

Facebook:https://www.facebook.com/UnogeeksSoftwareTrainingInstitute

Twitter: https://twitter.com/unogeeks


Share

Leave a Reply

Your email address will not be published. Required fields are marked *