更新时间:2021-07-29 16:52:10
coverpage
Hadoop Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files eBooks discount offers and more
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
Reader feedback
Customer support
Chapter 1. What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2. Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Chapter 3. Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Chapter 4. Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters status and other output
Time for action – creating counters task states and writing log output
Chapter 5. Advanced MapReduce Techniques
Simple advanced and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Chapter 6. When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Chapter 7. Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties