Top 43 Hadoop Interview Question And Answers with PDF

Top 43 Hadoop Interview Question And Answers with PDF: Here are the top interview questions of Hadoop. Be prepared for the interview. You can download the entire questions in a pdf format. so guys All the best for your interview.

1) What is Hadoop?

Hadoop is a distributed computing platform. It is written in Java. It consists of features like Google File System and MapReduce.


2) What platform and Java version are required to run Hadoop?

Java 1.6.x or higher versions are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X, and Solaris are more famous for working.


3) What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.


4) What are the most common input formats defined in Hadoop?

These are the most common input formats defined in Hadoop:

  1. TextInputFormat
  2. KeyValueInputFormat
  3. SequenceFileInputFormat

TextInputFormat is a by default input format.


5) How do you categorize a big data?

The big data can be categorized using the following features:

  • Volume
  • Velocity
  • Variety

6) Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?

Top 43 Hadoop Interview Question And Answers with PDF: In Hadoop for submitting and tracking MapReduce jobs,  JobTracker is used. Job tracker runs on its own JVM process

Job Tracker performs following actions in Hadoop

  • Client application submit jobs to the job tracker
  • JobTracker communicates to the Name mode to determine data location
  • Near the data or with available slots JobTracker locates TaskTracker nodes
  • On chosen TaskTracker Nodes, it submits the work
  • When a task fails, Job tracker notifies and decides what to do then.
  • The TaskTracker nodes are monitored by JobTracker

7) Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there are some issues with data node or task tracker

8) Explain what combiners are and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used.  The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner.  The execution of combiner is not guaranteed in Hadoop

9) What happens when a data node fails?

When a data node fails

  • Jobtracker and namenode detect the failure
  • On the failed node all tasks are re-scheduled
  • Namenode replicates the user’s data to another node

10) Explain what is Speculative Execution?

In Hadoop during Speculative Execution, a certain number of duplicate tasks are launched.  On a different slave node, multiple copies of the same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking a long time to complete a task, Hadoop will create a duplicate task on another disk.  A disk that finishes the task first is retained and disks that do not finish first are killed.

11) What is InputSplit in Hadoop? Explain.

When a Hadoop job runs, it splits input files into chunks and assigns each split to a mapper for processing. It is called the InputSplit.


12) What is TextInputFormat?

In TextInputFormat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text


13) What is the SequenceFileInputFormat in Hadoop?

Top 43 Hadoop Interview Question And Answers with PDF: In Hadoop, SequenceFileInputFormat is used to read files in sequence. It is a specific compressed binary file format which passes data between the output of one MapReduce job to the input of some other MapReduce job.


14) How many InputSplits is made by a Hadoop Framework?

Hadoop makes 5 splits as follows:

  • One split for 64K files
  • Two splits for 65MB files, and
  • Two splits for 127MB files

15) What is the use of RecordReader in Hadoop?

InputSplit is assigned with work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

16) What is a “Distributed Cache” in Apache Hadoop?

Top 43 Hadoop Interview Question And Answers with PDF: In Hadoop, data chunks process independently in parallel among DataNodes, using a program written by the user. If we want to access some files from all the DataNodes, then we will put that file to distributed cache.

Big Data Interview Questions For Freshers – Distributed Cache

MapReduce framework provides Distributed Cache to caches files needed by the applications. It can cache read-only text files, archives, jar files, etc.
Once we have cached a file for our job. Then, Hadoop will make it available on each data nodes where map/reduce tasks are running. Then, we can access files from all the data nodes in our map and reduce job.
An application which needs to use distributed cache should make sure that the files are available on URLs. URLs can be either https:// or http://. Now, if the file is present on the mentioned URLs. The user mentions it to be cache file to the distributed cache. This framework will copy the cache file on all the nodes before starting of tasks on those nodes. By default size of the distributed cache is 10 GB. We can adjust the size of the distributed cache using local.cache.size.

17) How is security achieved in Hadoop?

Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.

18) Why does one remove or add nodes in a Hadoop cluster frequently?

The most important features of the Hadoop is its utilization of Commodity hardware. However, this leads to frequent Datanode crashes in a Hadoop cluster.
Another striking feature of Hadoop is the ease of scale by the rapid growth in data volume.
Hence, due to the above reasons, administrator Add/Remove DataNodes in a Hadoop Cluster.

19) What is throughput in Hadoop?

The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:

  • The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
  • Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.

20) How to restart NameNode or all the daemons in Hadoop?

By following methods we can restart the NameNode:

  • You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
  • Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons.
    The sbin directory inside the Hadoop directory store these script files.

21) Explain what is Hadoop?

It is an open-source software framework for storing data and running applications on clusters of commodity hardware.  It provides enormous processing power and massive storage for any type of data.

22) Mention what is the difference between an RDBMS and Hadoop?

RDBMSHadoop
RDBMS is a relational database management systemHadoop is a node based flat structure
It used for OLTP processing whereas HadoopIt is currently used for analytical and for BIG DATA processing
In RDBMS, the database cluster uses the same data files stored in a shared storageIn Hadoop, the storage data can be stored independently in each processing node.
You need to preprocess data before storing ityou don’t need to preprocess data before storing it

23) Mention Hadoop core components?

Hadoop core components include,

  • HDFS
  • MapReduce

24) What is NameNode in Hadoop?

NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

25) Mention what are the data components used by Hadoop?

Data components used by Hadoop are

  • Pig
  • Hive

26) How is indexing done in HDFS?

There is a very unique way of indexing in Hadoop. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which specifies the location of the next part of the data.


27) What happens when a data node fails?

If a data node fails the job tracker and name node will detect the failure. After that, all tasks are re-scheduled on the failed node and then name node will replicate the user data to another node.


28) What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.


29) What is a combiner in Hadoop?

Top 43 Hadoop Interview Question And Answers with PDF: A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to a reducer.


30) What are the Hadoop’s three configuration files?

Following are the three configuration files in Hadoop:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

31) Explain what is a sequence file in Hadoop?

To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.

32) When Namenode is down what happens to job tracker?

Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.

33) Explain how indexing in HDFS is done?

Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.

34) Explain is it possible to search for files using wildcards?

Yes, it is possible to search for files using wildcards.

35) List out Hadoop’s three configuration files?

The three configuration files are

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

36) Explain how can you check whether Namenode is working beside using the jps command?

Besides using the jps command, to check whether Namenode are working you can also use

/etc/init.d/hadoop-0.20-namenode status.

37) Explain what is “map” and what is “reducer” in Hadoop?

In Hadoop, a map is a phase in HDFS query solving.  A map reads data from an input location, and outputs a key value pair according to the input type.

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

38) In Hadoop, which file controls reporting in Hadoop?

In Hadoop, the hadoop-metrics.properties file controls reporting.

39) For using Hadoop list the network requirements?

For using Hadoop the list of network requirements are:

  • Password-less SSH connection
  • Secure Shell (SSH) for launching server processes

40) Mention what is rack awareness?

Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.

41) What is distributed cache in Hadoop?

Top 43 Hadoop Interview Question And Answers with PDF: A distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives, etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.


42) What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?

JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

Functionalities of JobTracker in Hadoop:

  • When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.
  • It locates TaskTracker nodes with available slots for data.
  • It assigns the work to the chosen TaskTracker nodes.
  • The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.

43) How JobTracker assign tasks to the TaskTracker?

The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.


44) Is it necessary to write jobs for Hadoop in the Java language?

No, There are many ways to deal with non-java codes. HadoopStreaming allows any shell command to be used as a map or reduce function.

Top 43 Hadoop Interview Question And Answers with PDF