Apache Hadoop is an open source, Java-based
programming framework widely used for the large-scale storage and processing of
Big Data in the distributed computing environment. Hadoop Distributed File
Systems – HDFS is the primary storage solution used by the Hadoop applications
Starting Date | Timings | Ending Date | |
---|---|---|---|
Wed Oct 2020 | 2PM - 4PM | Mon Nov 2020 | |
Tue Oct 2020 | 2-4pm | Sat Oct 2020 |
Module 1 – Introduction to Big data & Hadoop (1.5 hours)
·
What is Big data?
·
Sources of Big data
·
Categories of Big
data
·
Characteristics of
Big data
·
Use-cases of Big
data
·
Traditional RDBMS vs
Hadoop
·
What is Hadoop?
·
History of Hadoop
·
Understanding Hadoop
Architecture
·
Fundamental of HDFS
(Blocks, Name Node, Data Node, Secondary Name Node)
·
Block Placement
&Rack Awareness
·
HDFS Read/Write
·
Drawback with 1.X
Hadoop
·
Introduction to 2.X
Hadoop
·
High Availability
Module 2 – Linux (Complete Hands-on) (1 hour)
·
Making/creating
directories
·
Removing/deleting
directories
·
Print working
directory
·
Change directory
·
Manual pages
·
Help
·
Vi editor
·
Creating empty files
·
Creating file
contents
·
Copying file
·
Renaming files
·
Removing files
·
Moving files
·
Listing files and
directories
·
Displaying file
contents
Module 3 – HDFS (1 hour)
·
Understanding Hadoop
configuration files
·
Hadoop Components-
HDFS, MapReduce
·
Overview of Hadoop
Processes
·
Overview of Hadoop
Distributed File System
·
The building blocks
of Hadoop
·
Hands-On Exercise:
Using HDFS commands
Module 4 – Map Reduce (1.5 hours)
·
Map Reduce 1(MRv1)
o Map Reduce Introduction
o How Map Reduce works?
o Communication between Job Tracker and Task Tracker
o Anatomy of a Map Reduce Job Submission
·
MapReduce-2(YARN)
o Limitations of Current Architecture
o YARN Architecture
o Node Manager & Resource Manager
Module 5-SQL (Complete Hands-on) (5 hours)
·
DDL Commands
o Create DB
o Create table
o Alter table
o Drop table
o Truncate table
o Rename table
·
DML Commands
o Insert command
o Update command
o Delete command
·
SQL Constraints
o NOT NULL
o UNIQUE
o PRIMARY KEY
o FOREIGN KEY
o CHECK
·
Aggregate functions
o AVG ()
o COUNT ()
o FIRST ()
o LAST ()
o
Comingsoon QA: Here are the key
differences between HDFS and relational database: RDBMS Hadoop Data Types RDBMS
relies on the structured data and the schema of the data is always known. Any
kind of data can be stored into Hadoop i.e. Be it structured, unstructured or
semi-structured. Processing RDBMS
provides limited or no processing capabilities. Hadoop
allows us to process the data which is distributed across the cluster in a
parallel fashion. Schema on Read Vs. Write RDBMS
is based on ‘schema on write’ where schema validation is done before loading
the data. On
the contrary, Hadoop follows the schema on read policy. Read/Write Speed In
RDBMS, reads are fast because the schema of the data is already known. The
writes are fast in HDFS because no schema validation happens during HDFS
write. Cost Licensed
software, therefore, I have to pay for the software. Hadoop
is an open source framework. So, I don’t need to pay for the software. Best Fit Use Case RDBMS
is used for OLTP (Online Trasanctional Processing) system. Hadoop
is used for Data discovery, data analytics or OLAP system. “Big data” is the term
for a collection of large and complex data sets, that makes it difficult to
process using relational database management tools or traditional data
processing applications. It is difficult to capture, curate, store, search,
share, transfer, analyze, and visualize Big data. Big Data has emerged as
an opportunity for companies. Now they can successfully derive value from their
data and will have a distinct advantage over their competitors with enhanced
business decisions making capabilities. ♣ Tip: It will be a good
idea to talk about the 5Vs in such questions, whether it is asked specifically
or not! As we know Big Data is
growing at an accelerating rate, so the factors associated with it are also
evolving. To go through them and understand it in detail, I recommend you to go
through Big Data Tutorial blog. When “Big Data” emerged
as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a
framework which provides us various services or tools to store and process Big
Data. It helps in analyzing Big Data and making business decisions out of it,
which can’t be done efficiently and effectively using traditional systems. ♣ Tip: Now, while
explaining Hadoop, you should also explain the main components of Hadoop, i.e.: HDFS (Hadoop Distributed File System) is the storage unit
of Hadoop. It is responsible for storing different kinds of data as blocks in a
distributed environment. It follows master and slave topology. ♣ Tip: It is recommended
to explain the HDFS components too i.e. YARN (Yet Another Resource Negotiator) is the processing
framework in Hadoop, which manages resources and provides an execution
environment to the processes. ♣ Tip: Similarly, as we did in
HDFS, we should also explain the two components of YARN: If you want to learn in
detail about HDFS & YARN go through Hadoop Tutorial blog. Generally
approach this question by first explaining the HDFS daemons i.e. NameNode,
DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e.
ResorceManager and NodeManager, and lastly explaining
the JobHistoryServer. In this question, first
explain NAS and HDFS, and then compare their features as follows: This is an important
question and while answering this question, we have to mainly focus on two
points i.e. Passive NameNode and YARN architecture. Hadoop 1.x Hadoop 2.x Passive NameNode NameNode
is a Single Point of Failure Active
& Passive NameNode Processing MRV1
(Job Tracker & Task Tracker) MRV2/YARN
(ResourceManager & NodeManager) In HA (High Availability)
architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”. When the active
“NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the
cluster. Hence, the cluster is never without a “NameNode” and so it never
fails. One of the most
attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a
Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume.
Because of these two reasons, one of the most common task of a Hadoop administrator
is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop
Cluster. Read this blog to get a
detailed understanding on commissioning and
decommissioning nodes in a Hadoop cluster. HDFS supports
exclusive writes only.
When the first client
contacts the “NameNode” to open the file for writing, the “NameNode” grants a
lease to the client to create this file. When the second client tries to open
the same file for writing, the “NameNode” will notice that the lease for the
file is already granted to another client, and will reject the open request for
the second client. Comingsoon All Graduates and Postgraduates comingsoonPrerequisites
What are the objectives?
1.1.1
1. What are the basic differences between relational
database and HDFS?
1.1.1.1
RDBMS vs. Hadoop
1.1.2
2. Explain “Big Data” and what are five V’s of Big Data?
1.1.3
3. What is Hadoop and its components.
1.1.4
4. What are HDFS and YARN?
1.1.5
5. Tell me about the various Hadoop daemons and their roles
in a Hadoop cluster.
1.2
Hadoop HDFS Interview Questions
1.2.1
6. Compare HDFS with Network Attached Storage (NAS).
1.2.2
7. List the difference between Hadoop 1 and Hadoop 2.
1.2.2.1
Hadoop 1.x vs. Hadoop 2.x
1.2.3
8. What are active and passive “NameNodes”?
1.2.4
9. Why does one remove or add nodes in a Hadoop cluster
frequently?
1.2.5
10. What happens when two clients try to access the same
file in the HDFS?
Why to Learn this course?
Who should go for this training?
Additional Data