Cloudera

Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop

Course Outline
  1. Hadoop Fundamentals
    • The Motivation for Hadoop
    • Hadoop Overview
    • Data Storage: HDFS
    • Distributed Data Processing: YARN, MapReduce, and Spark
    • Data Processing and Analysis: Pig, Hive, and Impala
    • Data Integration: Sqoop
    • Other Hadoop Data Tools
    • Exercise Scenarios Explanation
  2. Introduction to Pig
    • What Is Pig?
    • Pig’s Features
    • Pig Use Cases
    • Interacting with Pig
  3. Basic Data Analysis with Pig
    • Pig Latin Syntax
    • Loading Data
    • Simple Data Types
    • Field Definitions
    • Data Output
    • Viewing the Schema
    • Filtering and Sorting Data
    • Commonly-Used Functions
  4. Processing Complex Data with Pig
    • Storage Formats
    • Complex/Nested Data Types
    • Grouping
    • Built-In Functions for Complex Data
    • Iterating Grouped Data
  5. Multi-Dataset Operations with Pig
    • Techniques for Combining Data Sets
    • Joining Data Sets in Pig
    • Set Operations
    • Splitting Data Sets
  6. Pig Troubleshooting and Optimization
    • Troubleshooting Pig
    • Logging
    • Using Hadoop’s Web UI
    • Data Sampling and Debugging
    • Performance Overview
    • Understanding the Execution Plan
    • Tips for Improving the Performance of Your Pig Jobs
  7. Introduction to Hive and Impala
    • What Is Hive?
    • What Is Impala?
    • Schema and Data Storage
    • Comparing Hive to Traditional Databases
    • Hive Use Cases
  8. Querying with Hive and Impala
    • Databases and Tables
    • Basic Hive and Impala Query Language Syntax
    • Data Types
    • Differences Between Hive and Impala Query Syntax
    • Using Hue to Execute Queries
    • Using the Impala Shell
  9. Data Management
    • Data Storage
    • Creating Databases and Tables
    • Loading Data
    • Altering Databases and Tables
    • Simplifying Queries with Views
    • Storing Query Results
  10. Data Storage and Performance
    • Partitioning Tables
    • Choosing a File Format
    • Managing Metadata
    • Controlling Access to Data
  11. Relational Data Analysis with Hive and Impala
    • Joining Datasets
    • Common Built-In Functions
    • Aggregation and Windowing
  12. Working with Impala
    • How Impala Executes Queries
    • Extending Impala with User-Defined Functions
    • Improving Impala Performance
  13. Analyzing Text and Complex Data with Hive
    • Complex Values in Hive
    • Using Regular Expressions in Hive
    • Sentiment Analysis and N-Grams
    • Conclusion
  14. Hive Optimization
    • Understanding Query Performance
    • Controlling Job Execution Plan
    • Bucketing
    • Indexing Data
  15. Extending Hive
    • SerDes
    • Data Transformation with Custom Scripts
    • User-Defined Functions
    • Parameterized Queries
  16. Choosing the Best Tool for the Job
    • Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
    • Which to Choose?

Duration: 32 Hours
Course Fee: INR. 75,000 + Tax