HDFS System


HDFS System

The Hadoop Distributed File System (HDFS) is a distributed and scalable file system designed for storing and managing large volumes of data across a cluster of commodity hardware. HDFS is a core component of the Apache Hadoop ecosystem and plays a crucial role in enabling distributed data storage and processing in big data applications. Here are the key components and characteristics of the Hadoop Distributed File System (HDFS):

Characteristics of HDFS:

  1. Distributed and Scalable: HDFS distributes data across multiple nodes in a cluster, allowing it to scale horizontally as the data volume grows. New nodes can be added to the cluster to accommodate more data.

  2. Fault Tolerance: HDFS achieves fault tolerance through data replication. It stores multiple copies (typically three) of each data block across different nodes in the cluster. If a node or block becomes unavailable, data can still be retrieved from the replicas.

  3. High Throughput: HDFS is optimized for high-throughput data access. It is well-suited for batch processing and data-intensive workloads.

  4. Write-Once, Read-Many Model: HDFS is optimized for a write-once, read-many model, making it suitable for storing and processing large volumes of data generated by applications like log files and sensor data.

  5. Block-Based Storage: HDFS divides large files into fixed-size blocks (typically 128 MB or 256 MB), which are distributed across the cluster. This block-based storage simplifies data distribution and replication.

Components of HDFS:

  1. NameNode:

    • The NameNode is the central metadata and namespace management server in HDFS. It keeps track of file and directory structures, permissions, and the mapping of data blocks to their locations.
    • The NameNode is a single point of failure in HDFS. To address this, Hadoop 2.x and later versions introduced the concept of a Secondary NameNode and HDFS High Availability (HA) for redundancy.
  2. DataNode:

    • DataNodes are responsible for storing and managing the actual data blocks. They receive data writes, read requests, and replication instructions from the NameNode.
    • DataNodes periodically send heartbeats and block reports to the NameNode to confirm their status and report available data blocks.
  3. Block Reports and Heartbeats:

    • DataNodes send block reports to the NameNode with information about the blocks they are storing. Heartbeats are also sent to indicate that the DataNodes are functioning correctly.
  4. Secondary NameNode:

    • The Secondary NameNode periodically checkpoints the state of the NameNode and assists in maintaining the file system’s integrity. It does not serve as a backup NameNode.
  5. HDFS Clients:

    • Applications and users interact with HDFS through HDFS clients. These clients communicate with the NameNode and DataNodes to read, write, and manage files and directories.
  6. Rack Awareness:

    • HDFS has built-in rack awareness, which helps optimize data locality and reduces network traffic. It ensures that data blocks are stored across racks for fault tolerance and efficient data retrieval.
  7. Web User Interfaces:

    • HDFS provides web-based user interfaces, such as the NameNode’s web UI and the HDFS Health Monitor, for monitoring the health and status of the HDFS cluster.

Hadoop Training Demo Day 1 Video:

You can find more information about Hadoop Training in this Hadoop Docs Link



Unogeeks is the No.1 IT Training Institute for Hadoop Training. Anyone Disagree? Please drop in a comment

You can check out our other latest blogs on Hadoop Training here – Hadoop Blogs

Please check out our Best In Class Hadoop Training Details here – Hadoop Training

💬 Follow & Connect with us:


For Training inquiries:

Call/Whatsapp: +91 73960 33555

Mail us at: info@unogeeks.com

Our Website ➜ https://unogeeks.com

Follow us:

Instagram: https://www.instagram.com/unogeeks


Twitter: https://twitter.com/unogeeks


Leave a Reply

Your email address will not be published. Required fields are marked *