Hadoop developer Course

 Apache Hadoop is an open source, Java-based programming framework widely used for the large-scale storage and processing of Big Data in the distributed computing environment. Hadoop Distributed File Systems – HDFS is the primary storage solution used by the Hadoop applications

Batch details of live online classes

Starting Date Timings Ending Date
Wed Oct 2020 2PM - 4PM Mon Nov 2020
Tue Oct 2020 2-4pm Sat Oct 2020

Course Price at

200.00

Download

Topics Covered


Module 1 – Introduction to Big data & Hadoop (1.5 hours)

·        What is Big data?

·        Sources of Big data

·        Categories of Big data

·        Characteristics of Big data

·        Use-cases of Big data

·        Traditional RDBMS vs Hadoop

·        What is Hadoop?

·        History of Hadoop

·        Understanding Hadoop Architecture

·        Fundamental of HDFS (Blocks, Name Node, Data Node, Secondary Name Node)

·        Block Placement &Rack Awareness

·        HDFS Read/Write

·        Drawback with 1.X Hadoop

·        Introduction to 2.X Hadoop

·        High Availability

Module 2 – Linux (Complete Hands-on) (1 hour)

·        Making/creating directories

·        Removing/deleting directories

·        Print working directory

·        Change directory

·        Manual pages

·        Help

·        Vi editor

·        Creating empty files

·        Creating file contents

·        Copying file

·        Renaming files

·        Removing files

·        Moving files

·        Listing files and directories

·        Displaying file contents

 

Module 3 – HDFS (1 hour)

·        Understanding Hadoop configuration files

·        Hadoop Components- HDFS, MapReduce

·        Overview of Hadoop Processes

·        Overview of Hadoop Distributed File System

·        The building blocks of Hadoop

·        Hands-On Exercise: Using HDFS commands

Module 4 – Map Reduce (1.5 hours)

·        Map Reduce 1(MRv1)

o   Map Reduce Introduction

o   How Map Reduce works?

o   Communication between Job Tracker and Task Tracker

o   Anatomy of a Map Reduce Job Submission

·        MapReduce-2(YARN)

o   Limitations of Current Architecture

o   YARN Architecture

o   Node Manager & Resource Manager

Module 5-SQL (Complete Hands-on) (5 hours)

·        DDL Commands

o   Create DB

o   Create table

o   Alter table

o   Drop table

o   Truncate table

o   Rename table

·        DML Commands

o   Insert command

o   Update command

o   Delete command

·        SQL Constraints

o   NOT NULL

o   UNIQUE

o   PRIMARY KEY

o   FOREIGN KEY

o   CHECK

·        Aggregate functions

o   AVG ()

o   COUNT ()

o   FIRST ()

o   LAST ()

o  

Prerequisites


Comingsoon

What are the objectives?


QA:

1.1.1         1. What are the basic differences between relational database and HDFS?

Here are the key differences between HDFS and relational database:

1.1.1.1   RDBMS vs. Hadoop

RDBMS

Hadoop

Data Types

RDBMS relies on the structured data and the schema of the data is always known.

Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.

Processing

RDBMS provides limited or no processing capabilities.

Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.

Schema on Read Vs. Write

RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.

On the contrary, Hadoop follows the schema on read policy.

Read/Write Speed

In RDBMS, reads are fast because the schema of the data is already known.

The writes are fast in HDFS because no schema validation happens during HDFS write.

Cost

Licensed software, therefore, I have to pay for the software.

Hadoop is an open source framework. So, I don’t need to pay for the software.

Best Fit Use Case

RDBMS is used for OLTP (Online Trasanctional Processing) system.

Hadoop is used for Data discovery, data analytics or OLAP system.

1.1.2         2. Explain “Big Data” and what are five V’s of Big Data?

“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.

Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!

  • Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes. 
  • Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  • Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
  • Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and may be difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
  • Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

As we know Big Data is growing at an accelerating rate, so the factors associated with it are also evolving. To go through them and understand it in detail, I recommend you to go through Big Data Tutorial blog.

1.1.3         3. What is Hadoop and its components. 

When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.

Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:

  • Storage unit– HDFS (NameNode, DataNode)
  • Processing framework– YARN (ResourceManager, NodeManager)

1.1.4         4. What are HDFS and YARN?

HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.

Tip: It is recommended to explain the HDFS components too i.e.

  • NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
  • DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.

YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

Tip: Similarly, as we did in HDFS, we should also explain the two components of YARN:   

  • ResourceManagerIt receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
  • NodeManagerNodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

If you want to learn in detail about HDFS & YARN go through Hadoop Tutorial blog.

1.1.5         5. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.

Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.

  • NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.
  • Datanode: It is the slave node that contains the actual data.
  • Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
  • ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
  • NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
  • JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.

1.2         Hadoop HDFS Interview Questions

1.2.1         6. Compare HDFS with Network Attached Storage (NAS).

In this question, first explain NAS and HDFS, and then compare their features as follows:

  • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
  • In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
  • HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS uses commodity hardware which is cost-effective, whereas a NAS is a high-end storage devices which includes high cost.

1.2.2         7. List the difference between Hadoop 1 and Hadoop 2.

This is an important question and while answering this question, we have to mainly focus on two points i.e. Passive NameNode and YARN architecture.

  • In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”. If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
  • Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.

1.2.2.1   Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x

Hadoop 2.x

Passive  NameNode

NameNode is a Single Point of Failure

Active & Passive NameNode

Processing

MRV1 (Job Tracker & Task Tracker)

MRV2/YARN (ResourceManager & NodeManager)

1.2.3                                                                                                             8. What are active and passive “NameNodes”?        

In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.

  • Active “NameNode” is the “NameNode” which works and runs in the cluster.
  • Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.

When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

1.2.4         9. Why does one remove or add nodes in a Hadoop cluster frequently?

One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.

Read this blog to get a detailed understanding on commissioning and decommissioning nodes in a Hadoop cluster.

1.2.5         10. What happens when two clients try to access the same file in the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

Why to Learn this course?


Comingsoon

Who should go for this training?


All Graduates and Postgraduates

Additional Data


comingsoon

© 2020 eitcafe. All rights Reserved.