Apache Hadoop

Cloudera Administrator Training for Apache Hadoop

Course Outline
  1. The Case for Apache Hadoop
    • Why Hadoop?
    • Core Hadoop Components
    • Fundamental Concepts
  2. HDFS
    • HDFS Features
    • Writing and Reading Files
    • NameNode Memory Considerations
    • Overview of HDFS Security
    • Using the Namenode Web UI
    • Using the Hadoop File Shell
  3. Getting Data into HDFS
    • Ingesting Data from External Sources with Flume
    • Ingesting Data from Relational Databases with Sqoop
    • REST Interfaces
    • Best Practices for Importing Data
  4. YARN and MapReduce
    • What Is MapReduce?
    • Basic MapReduce Concepts
    • YARN Cluster Architecture
    • Resource Allocation
    • Failure Recovery
    • Using the YARN Web UI
    • MapReduce Version 1
  5. Planning Your Hadoop Cluster
    • General Planning Considerations
    • Choosing the Right Hardware
    • Network Considerations
    • Configuring Nodes
    • Planning for Cluster Management
  6. Hadoop Installation and Initial Configuration
    • Deployment Types
    • Installing Hadoop
    • Specifying the Hadoop Configuration
    • Performing Initial HDFS Configuration
    • Performing Initial YARN and MapReduce Configuration
    • Hadoop Logging
  7. Installing and Configuring Hive, Impala, and Pig
    • Hive
    • Impala
    • Pig
  8. Hadoop Clients
    • What is a Hadoop Client?
    • Installing and Configuring Hadoop Clients
    • Installing and Configuring Hue
    • Hue Authentication and Authorization
  9. Cloudera Manager
    • The Motivation for Cloudera Manager
    • Cloudera Manager Features
    • Express and Enterprise Versions
    • Cloudera Manager Topology
    • Installing Cloudera Manager
    • Installing Hadoop Using Cloudera Manager
    • Performing Basic Administration Tasks Using Cloudera Manager
  10. Advanced Cluster Configuration
    • Advanced Configuration Parameters
    • Configuring Hadoop Ports
    • Explicitly Including and Excluding Hosts
    • Configuring HDFS for Rack Awareness
    • Configuring HDFS High Availability
  11. Hadoop Security
    • Why Hadoop Security Is Important
    • Hadoop’s Security System Concepts
    • What Kerberos Is and How it Works
    • Securing a Hadoop Cluster with Kerberos
  12. Managing and Scheduling Jobs
    • Managing Running Jobs
    • Scheduling Hadoop Jobs
    • Configuring the FairScheduler
    • Impala Query Scheduling
  13. Cluster Maintenance
    • Checking HDFS Status
    • Copying Data Between Clusters
    • Adding and Removing Cluster Nodes
    • Rebalancing the Cluster
    • Cluster Upgrading
  14. Cluster Monitoring and Troubleshooting
    • General System Monitoring
    • Monitoring Hadoop Clusters
    • Common Troubleshooting Hadoop Clusters
    • Common Misconfigurations

Duration: 32 Hours
Course Fee: INR. 1,25,000 + Tax


Cloudera Developer Training for Apache Hadoop

Course Outline
  1. Motivation for Hadoop
    • Problems with Traditional Large-Scale Systems
    • Requirements for a New Approach
  2. Hadoop: Basic Concepts
    • Hadoop Distributed File System (HDFS)
    • MapReduce
    • Anatomy of a Hadoop Cluster
    • Other Hadoop Ecosystem Components
  3. Writing a MapReduce Program
    • MapReduce Flow
    • Examining a Sample MapReduce Program
    • Basic MapReduce API Concepts
    • Driver Code
    • Mapper
    • Reducer
    • Streaming API
    • Using Eclipse for Rapid Development
    • New MapReduce API
  4. Integrating Hadoop into the Workflow
    • Relational Database Management Systems
    • Storage Systems
    • Importing Data from a Relational Database Management System with Sqoop
    • Importing Real-Time Data with Flume
    • Accessing HDFS Using FuseDFS and Hoop
  5. Delving Deeper into the Hadoop API
    • ToolRunner
    • Testing with MRUnit
    • Reducing Intermediate Data with Combiners
    • Configuration and Close Methods for Map/Reduce Setup and Teardown
    • Writing Partitioners for Better Load Balancing
    • Directly Accessing HDFS
    • Using the Distributed Cache
  6. Common MapReduce Algorithms
    • Sorting and Searching
    • Indexing
    • Machine Learning with Mahout
    • Term Frequency
    • Inverse Document Frequency
    • Word Co-Occurrence
  7. Using Hive and Pig
    • Hive Basics
    • Pig Basics
  8. Practical Development Tips and Techniques
    • Debugging MapReduce Code
    • Using LocalJobRunner Mode for Easier Debugging
    • Retrieving Job Information with Counters
    • Logging
    • Splittable File Formats
    • Determining the Optimal Number of Reducers
    • Map-Only MapReduce Jobs
  9. Advanced MapReduce Programming
    • Custom Writables and WritableComparables
    • Saving Binary Data Using SequenceFiles and Avro Files
    • Creating InputFormats and OutputFormats
  10. Joining Data Sets in MapReduce
    • Map-Side Joins
    • Secondary Sort
    • Reduce-Side Joins
  11. Graph Manipulation in Hadoop
    • Graph Techniques
    • Representing Graphs in Hadoop
    • Implementing a Sample Algorithm: Single Source Shortest Path
  12. Creating Workflows with Oozie
    • Motivation for Oozie
    • Workflow Definition Format

Duration: 32 Hours
Course Fee: INR. 1,00,000 + Tax


Cloudera Essentials for Apache Hadoop

Course Outline
  1. Why Hadoop?
    • Motivation for Hadoop
    • Use Cases and Case Studies about Hadoop
  2. Hadoop Ecosystem
    • MapReduce and HDFS
    • Hive
    • Pig
    • HBase
    • Sqoop
    • Flume
    • Hue
    • Cloudera’s Distribution of Hadoop (CDH)
  3. Hadoop into Your Architecture
    • Augment Your Existing Environment
    • Relational Databases
    • SANs
    • OLAP Systems and More
  4. Managing a Hadoop Cluster
    • People Resources Required
    • Physical Resources Required
    • Cost to Organization
    • Scale for Growth
  5. Apache Open-Source Model and Cloudera’s Role

Duration: 8 Hours
Course Fee: INR. 25,000 + Tax


Data Science and Big Data Analytics

Course Outline
  1. Big Data Analytics
    • Big Data
    • State of the Practice in Analytics
    • Data Scientist
    • Big Data Analytics in Industry Verticals
  2. Data Analytics Lifecycle
    • Discovery
    • Data Preparation
    • Model Planning
    • Model Building
    • Communicating Results
    • Operationalizing
  3. Basic Data Analytic Methods Using R
    • Using R to Look at Data
    • Analyzing and Exploring the Data
    • Statistics for Model Building and Evaluation
  4. Advanced Analytics: Theory and Methods
    • K Means Clustering
    • Association Rules
    • Linear Regression
    • Logistic Regression
    • Na├»ve Bayesian Classifier
    • Decision Trees
    • Time Series Analysis
    • Text Analysis
  5. Advanced Analytics: Technologies and Tools
    • Analytics for Unstructured Data
    • MapReduce and Hadoop
    • Hadoop Ecosystem
    • In-Database Analytics: SQL Essentials
    • Advanced SQL and MADlib for In-Database Analytics
  6. Putting it All Together
    • Operationalizing an Analytics Project
    • Creating the Final Deliverables
    • Data Visualization Techniques
    • Final Lab Exercise on Big Data Analytics

Duration: 42 Hours
Course Fee: INR. 1,00,000 + Tax