Overview

  • Focus on large-scale computing for data mining.
  • Challenges:
    • How to distribute computation efficiently.
    • Handling huge data volumes beyond single-machine capacity.
    • Tolerating node failures in distributed systems.

Motivation

Example: The Web

  • 20+ billion web pages × 20KB = ~400TB data.
  • A single machine would take ~4 months just to read this data.
  • Requires clusters: commodity Linux nodes connected via Ethernet.

Distributed Systems

  • Google and Hadoop leverage distributed file systems (GFS, HDFS).
  • Computation is brought to the data.
  • Files are split into chunks (16–64MB) that are replicated for reliability.
  • Master node (NameNode) manages metadata and file location.

MapReduce Model

  • Distributes work across multiple nodes.
  • Works on key–value pairs:
    ( \text{Map:} (k,v) \rightarrow [(k’,v’)] )
    ( \text{Reduce:} (k’, [v’]) \rightarrow [(k”, v”)] )
  • Robust against failures—automatically reschedules tasks.

Word Count Example

  1. Input: Collection of text documents.
  2. Map step: Produce (word, 1) pairs.
  3. Group by key / Shuffle: Combine values for each word.
  4. Reduce step: Sum counts → (word, total_count).

System Components

  • Master node coordinates tasks.
  • Workers execute Map/Reduce tasks.
  • Intermediate results stored on local file systems.

Refinements

  • Combiner: Local aggregation before Reduce for efficiency (must be associative and commutative).
  • Partition Function: Custom network distribution, e.g.,
    hash(hostname(URL)) mod R.

Suitable Problems

  • Large-scale processing tasks like:
    • Log analysis
    • Graph/link analysis
    • Machine learning on big datasets
  • Not ideal for: frequent updates, transactional processing (e.g., e-commerce).

Summary

  • MapReduce simplifies distributed data processing.
  • Separates logic (Map, Reduce) from system complexity (data shuffling, fault recovery, task scheduling).