
Protecting the filesystem metadata
Because the fsimage
file is so critical to the filesystem, its loss is a catastrophic failure. In Hadoop 1, where the NameNode was a single point of failure, the best practice was to configure the NameNode to synchronously write the fsimage
and edits files to both local storage plus at least one other location on a remote filesystem (often NFS). In the event of NameNode failure, a replacement NameNode could be started using this up-to-date copy of the filesystem metadata. The process would require non-trivial manual intervention, however, and would result in a period of complete cluster unavailability.
Secondary NameNode not to the rescue
The most unfortunately named component in all of Hadoop 1 was the Secondary NameNode, which, not unreasonably, many people expect to be some sort of backup or standby NameNode. It is not; instead, the Secondary NameNode was responsible only for periodically reading the latest version of the fsimage
and edits file and creating a new up-to-date fsimage
with the outstanding edits applied. On a busy cluster, this checkpoint could significantly speed up the restart of the NameNode by reducing the number of edits it had to apply before being able to service clients.
In Hadoop 2, the naming is more clear; there are Checkpoint nodes, which do the role previously performed by the Secondary NameNode, plus Backup NameNodes, which keep a local up-to-date copy of the filesystem metadata even though the process to promote a Backup node to be the primary NameNode is still a multistage manual process.
Hadoop 2 NameNode HA
In most production Hadoop 2 clusters, however, it makes more sense to use the full High Availability (HA) solution instead of relying on Checkpoint and Backup nodes. It is actually an error to try to combine NameNode HA with the Checkpoint and Backup node mechanisms.
The core idea is for a pair (currently no more than two are supported) of NameNodes configured in an active/passive cluster. One NameNode acts as the live master that services all client requests, and the second remains ready to take over should the primary fail. In particular, Hadoop 2 HDFS enables this HA through two mechanisms:
- Providing a means for both NameNodes to have consistent views of the filesystem
- Providing a means for clients to always connect to the master NameNode
There are actually two mechanisms by which the active and standby NameNodes keep their views of the filesystem consistent; use of an NFS share or Quorum Journal Manager (QJM).
In the NFS case, there is an obvious requirement on an external remote NFS file share—note that as use of NFS was best practice in Hadoop 1 for a second copy of filesystem metadata many clusters already have one. If high availability is a concern, though it should be borne in mind that making NFS highly available often requires high-end and expensive hardware. In Hadoop 2, HA uses NFS; however, the NFS location becomes the primary location for the filesystem metadata. As the active NameNode writes all filesystem changes to the NFS share, the standby node detects these changes and updates its copy of the filesystem metadata accordingly.
The QJM mechanism uses an external service (the Journal Managers) instead of a filesystem. The Journal Manager cluster is an odd number of services (3, 5, and 7 are the most common) running on that number of hosts. All changes to the filesystem are submitted to the QJM service, and a change is treated as committed only when a majority of the QJM nodes have committed the change. The standby NameNode receives change updates from the QJM service and uses this information to keep its copy of the filesystem metadata up to date.
The QJM mechanism does not require additional hardware as the Checkpoint nodes are lightweight and can be co-located with other services. There is also no single point of failure in the model. Consequently, the QJM HA is usually the preferred option.
In either case, both in NFS-based HA and QJM-based HA, the DataNodes send block status reports to both NameNodes to ensure that both have up-to-date information of the mapping of blocks to DataNodes. Remember that this block assignment information is not held in the fsimage
/edits data.
Client configuration
The clients to the HDFS cluster remain mostly unaware of the fact that NameNode HA is being used. The configuration files need to include the details of both NameNodes, but the mechanisms for determining which is the active NameNode—and when to switch to the standby—are fully encapsulated in the client libraries. The fundamental concept though is that instead of referring to an explicit NameNode host as in Hadoop 1, HDFS in Hadoop 2 identifies a nameservice ID for the NameNode within which multiple individual NameNodes (each with its own NameNode ID) are defined for HA. Note that the concept of nameservice ID is also used by NameNode federation, which we briefly mentioned earlier.
How a failover works
Failover can be either manual or automatic. A manual failover requires an administrator to trigger the switch that promotes the standby to the currently active NameNode. Though automatic failover has the greatest impact on maintaining system availability, there might be conditions in which this is not always desirable. Triggering a manual failover requires running only a few commands and, therefore, even in this mode, the failover is significantly easier than in the case of Hadoop 1 or with Hadoop 2 Backup nodes, where the transition to a new NameNode requires substantial manual effort.
Regardless of whether the failover is triggered manually or automatically, it has two main phases: confirmation that the previous master is no longer serving requests and the promotion of the standby to be the master.
The greatest risk in a failover is to have a period in which both NameNodes are servicing requests. In such a situation, it is possible that conflicting changes might be made to the filesystem on the two NameNodes or that they might become out of sync. Even though this should not be possible if the QJM is being used (it only ever accepts connections from a single client), out-of-date information might be served to clients, who might then try to make incorrect decisions based on this stale metadata. This is, of course, particularly likely if the previous master NameNode is behaving incorrectly in some way, which is why the need for the failover is identified in the first place.
To ensure only one NameNode is active at any time, a fencing mechanism is used to validate that the existing NameNode master has been shut down. The simplest included mechanism will try to ssh into the NameNode host and actively kill the process though a custom script can also be executed, so the mechanism is flexible. The failover will not continue until the fencing is successful and the system has confirmed that the previous master NameNode is now dead and has released any required resources.
Once fencing succeeds, the standby NameNode becomes the master and will start writing to the NFS-mounted fsimage
and edits logs if NFS is being used for HA or will become the single client to the QJM if that is the HA mechanism.
Before discussing automatic failover, we need a slight segue to introduce another Apache project that is used to enable this feature.