Hadoop Interview Questions

This set of interview questions focuses on Hadoop, covering basic to advanced concepts. Whether you’re just starting your preparation or need a refresher, these questions will boost your confidence. They’re suitable for campus and company interviews, as well as competitive examinations, spanning positions from entry to mid-level experience.

Hadoop Interview Questions with Answers

1. What is big data?

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. It is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.

2. What is the purpose of Hadoop?

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.

3. What is HDFS?

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

4. What is the purpose of Mapreduce?

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

5. What is the difference between Hadoop and RDBMS?

Traditional RDBMS is used to handle relational data. Hadoop works well with structured as well as unstructured data, and supports various serialization and data formats. Unlike RDMS that you can query in realtime, the map-reduce process takes time and doesnt produce immediate results.

6. What is the purpose of mapper in Mapreduce?

Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into an intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.


7. What is the purpose of reducer in Mapreduce?

Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases:

  • Shuffle
  • Sort
  • Reduce

8. What is rack awareness?

A Hadoop Cluster is a collection of racks. Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.

9. What is the purpose of namenode?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster.

10. What is the purpose of jobtracker?

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

11. What is the purpose of task tracker?

A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

12. What is checkpoint namenode?

The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.

13. What is the purpose of block scanner in HDFS?

Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

14. What is task instance?

Task instances are the actual Map Reduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

15. What is fault tolerance in hadoop?

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.
The primary way that Hadoop achieves fault tolerance is through restarting tasks. Individual task nodes (TaskTrackers) are in constant communication with the head node of the system, called the JobTracker. If a TaskTracker fails to communicate with the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume that the TaskTracker in question has crashed. The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.


16.What is the purpose of data node?

A DataNode stores data in the Hadoop File System. A functional filesystem has more than one DataNode, with data replicated across them.On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

17. What is commodity hardware?

Commodity hardware is a term for affordable devices that are generally compatible with other such devices. In a process called commodity computing or commodity cluster computing, these devices are often networked to provide more processing power when those who own them cannot afford to purchase more elaborate supercomputers, or want to maximize savings in IT design. Hadoop is designed to run on commodity hardware.

18. What is the purpose of secondary namenode?

The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.

19. What is the purpose of combiner?

The combiner function is used as an optimization for the MapReduce job. The combiner function runs on the output of the map phase and is used as a filtering or an aggregating step to lessen the number of intermediate keys that are being passed to the reducer. In most of the cases the reducer class is set to be the combiner class. The difference lies in the output from these classes. The output of the combiner class is the intermediate data that is passed to the reducer whereas the output of the reducer is passed to the output file on disk.

20. What is the purpose of backup node?

The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.


21. What is block in HDFS?

The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

22. What is shuffling in mapreduce?

Shuffle phase is done by framework. Shuffling is the process by which intermediate data from mappers are transferred to 0, 1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.

23. What is InputSplit in Hadoop?

InputSplit represents the data to be processed by an individual Mapper.Typically, it presents a byte-oriented view on the input and is the responsibility of RecordReader of the job to process this and present a record-oriented view.

24. What is the purpose of distributed cache in Hadoop?

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications. DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. DistributedCache tracks modification timestamps of the cache files.

25. What is Hadoop Streaming?

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. The Hadoop Streaming mappers and reducers take input from stdin and return data to stdout. Each mapper/reducer invocation is feed a chunk of data (not just one record). Whereas the standard mapper/reducer interfaces are called for each record.

Useful Resources:

If you find any mistake above, kindly email to [email protected]

Subscribe to our Newsletters (Subject-wise). Participate in the Sanfoundry Certification contest to get free Certificate of Merit. Join our social networks below and stay updated with latest contests, videos, internships and jobs!

Youtube | Telegram | LinkedIn | Instagram | Facebook | Twitter | Pinterest
Manish Bhojasia - Founder & CTO at Sanfoundry
Manish Bhojasia, a technology veteran with 20+ years @ Cisco & Wipro, is Founder and CTO at Sanfoundry. He lives in Bangalore, and focuses on development of Linux Kernel, SAN Technologies, Advanced C, Data Structures & Alogrithms. Stay connected with him at LinkedIn.

Subscribe to his free Masterclasses at Youtube & discussions at Telegram SanfoundryClasses.