Big Data Hadoop Training Course Curriculum

Module 1: Introduction to Big Data and Hadoop

Big Data Overview
Big Data Analytics
What is Big Data?
Challenges of Traditional Systems
Distributed systems
Introduction to Hadoop
Components of Hadoop Ecosystem
Commercial Hadoop Distributions

Module 2: Understanding HDFS and MapReduce

Introduction to MapReduce
Introduction to HDFS
Hadoop Distributed File System – Replications, Block Size, Secondary node, High Availability
YARN – resource manager and node manager

Module 3: Hadoop Installation and Setup

Architecture of Hadoop cluster
What is High Availability and Federation?
How to setup a production cluster?
Various shell commands in Hadoop
Understanding configuration files in Hadoop
Installing a single node cluster with Cloudera Manager
Understanding Spark, Scala, Sqoop, Pig, and Flume

Module 4: Deep Dive in MapReduce

Learning the working mechanism of MapReduce
Understanding the mapping and reducing stages in MR
Various terms in MR like Input & Output Format, Partitioners, Combiners, Shuffle, and Sort

Module 5: Introduction to Hive

Introducing Hadoop Hive
Detailed architecture of Hive
Comparing Hive with Pig and RDBMS
Working with Hive Query Language
Creation of a database, table, group by and other clauses
Various types of Hive tables, HCatalog
Storing the Hive Results, Hive partitioning, and Buckets

Module 6: Advanced Hive and Impala

Indexing in Hive
The ap Side Join in Hive
Working with complex data types
The Hive user-defined functions
Introduction to Impala
Comparing Hive with Impala
The detailed architecture of Impala

Module 7: Introduction to Pig

Apache Pig introduction and its various features
Various data types and schema in Hive
The available functions in Pig, Hive Bags, Tuples, and Fields

Module 8: Flume, Sqoop and HBase

Apache Sqoop introduction
Importing and exporting data
Performance improvement with Sqoop
Sqoop limitations
Introduction to Flume and understanding the architecture of Flume
What is HBase and the CAP theorem?

Module 9: Writing Spark Applications Using Scala

Using Scala for writing Apache Spark applications
Detailed study of Scala
The need for Scala
The concept of object-oriented programming
Executing the Scala code
Scala Classes - Getters, Setters, & Constructors
Scala Classes - Abstract, extending objects & Overriding

Module 10: Project Use Case

Introduction to Scala packages and imports
The selective imports
The Scala test classes
Introduction to JUnit test class
JUnit interface via JUnit 3 suite for Scala test
Packaging of Scala applications in the directory structure
Examples of Spark Split and Spark Scala

Module 11: Introduction to Spark

Introduction to Spark
Spark overcomes the drawbacks of working on MapReduce
Understanding in-memory MapReduce
Interactive operations on MapReduce
Spark stack, fine vs. coarse-grained update
Spark stack, Spark Hadoop YARN, HDFS Revision, and YARN Revision
The overview of Spark and how it is better than Hadoop
Deploying Spark without Hadoop
Spark history server and Cloudera distribution

Module 12: Spark Basics

Spark installation guide
Spark configuration
Memory management
Executor memory vs. driver memory
Working with Spark Shell
The concept of resilient distributed datasets (RDD)
Learning to do functional programming in Spark
The architecture of Spark

Module 13: Working with RDDs in Spark

Spark RDD
Creating RDDs
RDD partitioning
Operations and transformation in RDD
Deep dive into Spark RDDs
The RDD general operations
Read-only partitioned collection of records
Using the concept of RDD for faster and efficient data processing
RDD action for the collect, count, collects map, save-as-text-files, and pair RDD functions

Module 14: Aggregating Data with Pair RDDs

Understanding the concept of key-value pair in RDDs
Learning how Spark makes MapReduce operations faster
Various operations of RDD
MapReduce interactive operations
Fine and coarse-grained update
Spark stack

Module 15: Writing and Deploying Spark Applications

Comparing the Spark applications with Spark Shell
Creating a Spark application using Scala or Java
Deploying a Spark application
Scala built application
Creation of the mutable list, set and set operations, list, tuple, and concatenating list
Creating an application using SBT
Deploying an application using Maven
The web user interface of Spark application
A real-world example of Spark
Configuring of Spark

Module 16: Parallel Processing

Learning about Spark parallel processing
Deploying on a cluster
Introduction to Spark partitions
File-based partitioning of RDDs
Understanding of HDFS and data locality
Mastering the technique of parallel operations
Comparing repartition and coalesce
RDD actions

Module 17: Spark RDD Persistence

The execution flow in Spark
Understanding the RDD persistence overview
Spark execution flow, and Spark terminology
Distribution shared memory vs. RDD
RDD limitations
Spark shell arguments
Distributed persistence
RDD lineage
Key-value pair for sorting implicit conversions like CountByKey, ReduceByKey, SortByKey

Module 18: Spark MLlib

Introduction to Machine Learning
Types of Machine Learning
Introduction to MLlib
Various ML algorithms supported by MLlib
Linear & logistic regression, decision tree, random forest, and K-means clustering techniques

Module 19: Integrating Apache Flume and Apache Kafka

Why Kafka and what is Kafka?
Kafka architecture
Kafka workflow
Configuring Kafka cluster
Operations
Kafka monitoring tools
Integrating Apache Flume and Apache Kafka

Module 20: Spark Streaming

Introduction to Spark Streaming
Features of Spark Streaming
Spark Streaming workflow
Initializing StreamingContext, discretized Streams (DStreams), input DStreams and Receivers
Transformations & output operations on DStreams, windowed operators and why it is useful
Important windowed operators and stateful operators

Module 21: Improving Spark Performance

Introduction to various variables in Spark like shared variables and broadcast variables
Learning about accumulators
The common performance issues
Troubleshooting the performance problems

Module 22: Spark SQL and Data Frames

Learning about Spark SQL
The context of SQL in Spark for providing structured data processing
JSON support in Spark SQL
Working with XML data
Parquet files
Creating Hive context
Writing data frame to Hive
Reading JDBC files
Understanding the data frames in Spark
Creating Data Frames
Manual inferring of schema
Working with CSV files
Reading JDBC tables
Data frame to JDBC
User-defined functions in Spark SQL
Shared variables and accumulators
Learning to query and transform data in data frames
Data frame provides the benefit of both Spark RDD and Spark SQL
Deploying Hive on Spark as the execution engine

Module 23: Scheduling/Partitioning

Learning about the scheduling and partitioning in Spark
Hash & Range partition
Scheduling within and around applications
Static partitioning, dynamic sharing, and fair scheduling
Map partition with index, the Zip, and GroupByKey
Spark master high availability, standby masters with ZooKeeper, single-node recovery with the local file system and high order functions

Module 24: Hadoop Administration – Multi-node Cluster Setup Using Amazon EC2

Create a 4-node Hadoop cluster setup
Running the MapReduce Jobs on the Hadoop cluster
Successfully running the MapReduce code
Working with the Cloudera Manager setup

Module 25: Hadoop Administration – Cluster Configuration

Overview of Hadoop configuration
The importance of Hadoop configuration file
The various parameters and values of configuration
The HDFS parameters and MapReduce parameters
Setting up the Hadoop environment
The Include and Exclude configuration files
The administration and maintenance of name node, data node directory structures, and files
What is a File system image?
Understanding Edit log

Module 26: Hadoop Administration – Maintenance, Monitoring and Troubleshooting

Introduction to the checkpoint procedure, name node failure
How to ensure the recovery procedure, Safe Mode, Metadata and Data backup,
Various potential problems and solutions, what to look for and how to add and remove nodes

Module 27: ETL Connectivity with Hadoop Ecosystem

How ETL tools work in Big Data industry?
Introduction to ETL and data warehousing
Working with prominent use cases of Big Data in ETL industry
End-to-end ETL PoC showing Big Data integration with ETL tool

Module 28: Hadoop Application Testing

Importance of testing
Unit testing, Integration testing, Performance testing
Diagnostics, Nightly QA test, Benchmark and end-to-end tests
Functional testing, Release certification testing, Security testing
Scalability testing, Commissioning and Decommissioning of data nodes testing
Reliability testing, and Release testing

Module 29: Roles and Responsibilities of Hadoop Testing Professional

Understanding the Requirement
Preparation of the Testing Estimation
Test Cases, Test Data, Test Bed Creation, Test Execution
Defect Reporting, Defect Retest, Daily Status report delivery, Test completion
ETL testing at every stage (HDFS, Hive and HBase) while loading the input (logs, files, records, etc.)
using Sqoop/Flume
Data verification, Reconciliation, User Authorization & Authentication testing (Groups, Users, Privileges, etc.),
Reporting defects to the development team or manager and driving them to closure
Consolidating all the defects and create defect reports
Validating new feature and issues in Core Hadoop

Module 30: Framework Called MRUnit for Testing of MapReduce Programs

Report defects to the development team or manager and driving them to closure
Consolidate all the defects and create defect reports
Responsible for creating a testing framework called MRUnit for testing of MapReduce programs

Module 31: Unit Testing

Automation testing using the OOZIE
Data validation using the query surge tool

Module 32: Test Execution

Test plan for HDFS upgrade
Test automation and result

Module 33: CCA175 Spark and Hadoop Developer Certification Exam Prep

Explain CCA175 Spark and Hadoop Developer Certification Options
Discuss 50+ Important CCA175 Certification Questions
Practice CCA175 Certification questions

Module 34: Resume Preparation, Interview and Job Assistance

Prepare Crisp Resume as Big Data Hadoop Developer
Discuss common interview questions in Hadoop
Explain students what jobs they should target and how