Lecture Notes

❯

❯

Lecture 1 – Introduction to MapReduce

Lecture 1 – Introduction to MapReduce

Apr 14, 20262 min read

Overview

Focus on large-scale computing for data mining.
Challenges:
- How to distribute computation efficiently.
- Handling huge data volumes beyond single-machine capacity.
- Tolerating node failures in distributed systems.

Motivation

Example: The Web

20+ billion web pages × 20KB = ~400TB data.
A single machine would take ~4 months just to read this data.
Requires clusters: commodity Linux nodes connected via Ethernet.

Distributed Systems

Google and Hadoop leverage distributed file systems (GFS, HDFS).
Computation is brought to the data.
Files are split into chunks (16–64MB) that are replicated for reliability.
Master node (NameNode) manages metadata and file location.

MapReduce Model

Distributes work across multiple nodes.
Works on key–value pairs:
( \text{Map:} (k,v) \rightarrow [(k’,v’)] )
( \text{Reduce:} (k’, [v’]) \rightarrow [(k”, v”)] )
Robust against failures—automatically reschedules tasks.

Word Count Example

Input: Collection of text documents.
Map step: Produce (word, 1) pairs.
Group by key / Shuffle: Combine values for each word.
Reduce step: Sum counts → (word, total_count).

System Components

Master node coordinates tasks.
Workers execute Map/Reduce tasks.
Intermediate results stored on local file systems.

Refinements

Combiner: Local aggregation before Reduce for efficiency (must be associative and commutative).
Partition Function: Custom network distribution, e.g.,
hash(hostname(URL)) mod R.

Suitable Problems

Large-scale processing tasks like:
- Log analysis
- Graph/link analysis
- Machine learning on big datasets
Not ideal for: frequent updates, transactional processing (e.g., e-commerce).

Summary

MapReduce simplifies distributed data processing.
Separates logic (Map, Reduce) from system complexity (data shuffling, fault recovery, task scheduling).

Graph View

Overview
Motivation
Example: The Web
Distributed Systems
MapReduce Model
Word Count Example
System Components
Refinements
Suitable Problems
Summary

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community