Overview
- Focus on large-scale computing for data mining.
- Challenges:
- How to distribute computation efficiently.
- Handling huge data volumes beyond single-machine capacity.
- Tolerating node failures in distributed systems.
Motivation
Example: The Web
- 20+ billion web pages × 20KB = ~400TB data.
- A single machine would take ~4 months just to read this data.
- Requires clusters: commodity Linux nodes connected via Ethernet.
Distributed Systems
- Google and Hadoop leverage distributed file systems (GFS, HDFS).
- Computation is brought to the data.
- Files are split into chunks (16–64MB) that are replicated for reliability.
- Master node (NameNode) manages metadata and file location.
MapReduce Model
- Distributes work across multiple nodes.
- Works on key–value pairs:
( \text{Map:} (k,v) \rightarrow [(k’,v’)] )
( \text{Reduce:} (k’, [v’]) \rightarrow [(k”, v”)] )
- Robust against failures—automatically reschedules tasks.
Word Count Example
- Input: Collection of text documents.
- Map step: Produce (word, 1) pairs.
- Group by key / Shuffle: Combine values for each word.
- Reduce step: Sum counts → (word, total_count).
System Components
- Master node coordinates tasks.
- Workers execute Map/Reduce tasks.
- Intermediate results stored on local file systems.
Refinements
- Combiner: Local aggregation before Reduce for efficiency (must be associative and commutative).
- Partition Function: Custom network distribution, e.g.,
hash(hostname(URL)) mod R.
Suitable Problems
- Large-scale processing tasks like:
- Log analysis
- Graph/link analysis
- Machine learning on big datasets
- Not ideal for: frequent updates, transactional processing (e.g., e-commerce).
Summary
- MapReduce simplifies distributed data processing.
- Separates logic (Map, Reduce) from system complexity (data shuffling, fault recovery, task scheduling).