Unable to start Cluster Service on One Node
A Windows Failover Cluster demo I gave a work failed horribly when the same demo the previous week went perfectly. A case of Sod’s Law. For some reason one of the nodes wouldn’t join the cluster. So I was unable to demo the failover process.
I first tried starting the cluster service on the dead node.
Server Manager > Features > Failover Cluster Manager > ClusterName > Nodes > Node1
Right Click > More Actions > Start Cluster Service. This failed with the following error;
Checking the cluster event log I found the following errors;
Event Id: 1069
Cluster resource ‘Cluster Disk 4’ in clustered service or application ‘Cluster Group’ failed.
Event Id: 1570
Node ‘Node1’ failed to establish a communication session while joining the cluster. This was due to an authentication failure. Please verify that the nodes are running compatible versions of the cluster service software.
Event Id: 1573
Node ‘Node1’ failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available.
Next I thought I’d validate the node using the wizard. When trying to add the the other node to the list of wizard to check it would immediately fail with the error “OpenService RemoteRegistry failed”. Google suggestions included checking that the RemoteRegistry service was running, checking for any AD user accounts matching computer names, and for firewall issues. All of these checked out so, running out of ideas, I rebooted both cluster nodes.
Once both nodes were back up the cluster was running on the other node! So clearly there was nothing wrong with the nodes themselves. Rather, the issue was due to something between them. I attempted to access the other node with a unc path in explorer.
OK! Interesting! I checked the date and time on both server nodes and my domain controller and all were perfectly in sync. Just to double confirm I synchronised the cluster nodes clocks with the following command.
net time /domain:YourDomain /set
This appeared to have no effect on my problem at all. Finally I had the sense to check the time zone of my domain controller which revealed it was set to eastern United States (I’m in the UK) and my cluster nodes were get to GMT.
After I changed the domain controllers time zone to GMT, rebooted both nodes and the domain controller (wouldn’t work before doing this), my cluster was finally running happily again. I was able to failover clustered services and applications between nodes. Time to get ready for another successful demo next week!