Pages

Tuesday, September 9, 2025

MSSQL Failover Cluster Build Failure

MSSQL Failover Cluster Build Failure : An error occurred while creating the cluster and the nodes will be cleaned up.

Recently I was working on building an MSSQL Failover cluster and encountered an issue.

Just to give a little bit of background, I have built more than 50 clusters till now in my 16+ years of Windows System Administration profile and most of them have been built on a pair of Physical Servers having Shared storage coming from SAN and with Node and Disk Majority setup so I know at least the basics of how a cluster operates so I apologize in advance on any additional troubleshooting you might think that I should have performed.

The current setup was:

  • Two Dell PowerEdge 750 Servers with Raid 1 for OS (BOSS Disks)
  • Heartbeat cabling done and both nodes are communicating on private IP.
  • Have pre-staged a CNO and kept it in disabled state.
  • Storage presented from SAN on Dual Port HBA Adapters including 1 ldev of 1GB for Qorum.

I started the cluster build process and added both eligible nodes for validation.

The Validation passed without any errors,  but as soon as I moved further to create the cluster, It waited for couple of seconds and eventually failed with following error.

Error : An error occurred while creating the cluster and the nodes will be cleaned up. Please wait...

Following which I was presented with below error:

There was an error cleaning up the cluster nodes. Use Clear-ClusterNode to manually clean up the nodes.

An error occurred while creating the cluster.
An error occurred creating cluster 'MSSQLCLXXXXXXXX'.

The specified server cannot perform the requested operation

 

I tried using PowerShell cmdlets to create the cluster but as expected that also failed with same error:

Unfortunately I was not able to capture the error message and screen shots but here is what I have from the PowerShell Transcript which by default enabled on the servers and captures every command execution.

 

I was not able to figure out what could go wrong and why is the cluster not getting formed. Performed following prechecks again to make sure everything is in order.

  • Checked if a DC is reachable from both nodes. ==> PASS
  • Checked communication on Private and Public NIC between both nodes ==> PASS
  • Checked necessary DNS Resolution is working from both nodes. ==> PASS
  • Checked if SAN Storage can be brought online and offline on both nodes ==> PASS
  • Full control security permissions for both Computer Objects on the CNO in AD ==> PASS

Now even after everything set and working properly I started thinking of taking a network capture from the node while attempting cluster creation and then have it analyzed by some of my networking friends to see if there is something expected on the network layer but is not working properly.

This step had a dependency on someone with networking knowledge (I dont have) Hence at the back of my mind I was still thinking, What could be wrong !!!

I then went back to basics and launched a Command Prompt as Adninistrator

Typed in a command netstat -ano and started the cluster creation wizard again.

I tried to filter the output of netstat -ano  excluded everything other than ESTABLISHED & LISTENING.

I found a connection attempt to one IP Address on TCP port 3268 which was in SYN_SENT state

I know that 3268 is the Port number for Global Catalog and the IP address belonged to one of the DCs I have in my environment.

This was the moment I figured out that Cluster formation is dependent on Global Catalog and unless it is available/reachable on TCP port 3268 during Cluster formation, The cluster will not be formed and the process will fail.

I worked with Active Directory folks and got to know that Global Catalog ports has been restricted for the whole environment and the connectivity is whitelisted only upon request and valid justification. (Did not try to dig into the reason)

After Global Catalog connectivity was allowed, I attempted to create the cluster again and it worked without any issues.

So the summary is:

Connectivity to Global Catalog on TCP port 3268 is also necessary while trying to build a MSCS Failover Cluster


Happy troubleshooting !!!

Thursday, August 14, 2025

Windows Server 2025 Domain Join Issue – Fix for “Secure Channel Broken” Error

Microsoft released Windows Server 2025 for public back in November 2024. Following our organizations policy to introduce latest available Operating Systems, I was tasked with introducing Windows Server 2025 to the environment.

I started working on automating the installation using Packer which is an open source tool for automating OS Image creation. Ater the OS installation, when tried to join the server to an Active Directory (AD) domain I ran into login or trust issues after reboot.

As a standard system hardening procedure, we have a set of checks that is based on CIS Benchmarks and according to that we modify several OS settings to harden the OS Image.

After trying to join the server to domain from System Properties It shows successful but after reboot, when I try to login using a Domain ID, It does not allow logon and gives following error while trying from console:

"The Sign-in method you're trying to use isn't allowed"

 I got into the server using Local Administrator credentials and launched a Command Prompt as Administrator. Tried to see local group membership using

net localgroup administrators

No domain ids/Groups are listed as members (Domain Admins should be part of the group by default) 

 

When I try adding a Group manually it gives error that the secure channel is broken/not working properly.

 

 

Symptoms of the Problem

After installing Windows Server 2025 and attempting a domain join from System Properties:

  • The join process completes successfully with no errors.
  • After reboot:
    • Cannot log in with any domain account.
    • Only the local Administrator account works.
  • Trying to add a domain group (e.g., Domain Admins) to the local Administrators group fails with:
    • The trust relationship between this workstation and the primary domain failed
    • or The secure channel between this workstation and the domain controller is broken
  • Re-joining the domain does not fix the problem.

 

Why This Happens

With the latest CIS Benchmark checks, there is a check that prevents NTLM authentications from happening. This was set to Deny all which I updated to Audit all (allow but audit events)

Error Messages You Might See

These are the common errors linked to this problem:

  • The trust relationship between this workstation and the primary domain failed
  • The secure channel between this workstation and the domain controller is broken

Solution – Update Group Policy Settings

The fix is to update a specific Group Policy setting so that Windows Server 2025 can establish a compatible secure channel with your Domain Controllers.

  1. Open the Group Policy Management Console (GPMC) on a Domain Controller.
  2. Navigate to:
  3. Computer Configuration → Windows Settings → Security Settings → Local Policies → Security Options

'Network security: Restrict NTLM: Outgoing NTLM traffic to remote servers' set it to Audit all

  

  1. Apply the updated Group Policy.
  2. Run gpupdate /force

 

 After this change, new Windows Server 2025 machines should:

  • Join the domain without trust issues.
  • Already joined machines should show Domain groups within the Local administrators group automatically.
  • Allow domain logins normally
  • Maintain a healthy secure channel with Domain Controllers