Eliminating the impossible

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”  Sherlock Holmes

Since I haven’t found this issue elsewhere, and since it’s been awhile since I’ve blogged, I figured I’d post.

So, the scenario is:

Windows 2008 R2 Cluster that was pre-existing before I arrived on the scene.  2-nodes.  setup to run SQL Server.  SQL Server would run fine on Node A, but a failover to Node B would fail.

Some back history to the setup that wasn’t complete nor detailed.  But the problem was suspected to be with DNS or Active Directory.

I arrive on the scene and one of my jobs is to setup additional clustered SQL instances on this Windows Cluster. I do so, expecting to have the exact same issue. Nope. Things work fine once I figure out the rights my user needed but didn’t have in order to ADD the second node (Logon as a service btw). (For the time being I built a 1-node cluster, yes, you can do that, and then once I had the rights, simply added the 2nd node.)

So, now I’m in the situation with a 2-node cluster and 3 SQL instances.  Two fail over as expected.  One (the original) does not.

Time to put on my debugging hat on.

I won’t bore you with the details.  Suffice to say I tried a lot.

Compared ipconfig /all results – Everything the same (what wasn’t the same, I made the same where it made sense to.  Still no joy.)

Pinged the WINS and DNS servers from both boxes. OK, here was a difference.  Node A could ping both its primary and its secondary WINS server.  Node B could NOT ping its secondary WINS server.  Interesting. But, didn’t really seem like the issue since it couldn’t explain why the other 2 instances would fail over just fine.

Checked out the registry.  Same in both cases.

Start to look at error logs.  At first nothing.  Then realize that according to the timestamps, a SQLError Log IS being created on Node B.  I look even more closely. The service is actually STARTING!  But then it’s stopping. And in between there’s a bunch of errors about not being able to log in.  Very strange.

So now I try to tackle the problem from a different angle. I fail over the disk and IP resources but don’t tell the cluster service to startup SQL Server.

Then, I go to the command line and start the service manually.

Works fine. Connections can be made, etc.  Of course the cluster service doesn’t think it’s up, but that’s to be expected and ok at this point.

But, this is only a partial test.  Since maybe it’s my user that can do this, but not the service account.

So, go to the services screen, change SQL Server to startup using my account and confirm that works.  Great.

Change it back to the designated service account and start it manually from there.  Starts just fine.

BUT, no login errors.

Finally that part clicks.  The thing trying to login and do a query is the CLUSTER Service itself. It’s simply the heartbeat Cluster Service uses to make sure the node started. No wonder, it is attempting to start the node and then failing. It never hears the heartbeat.

Since it takes about a minute for the startup to actually fail, I confirm that I can connect to SQL Server in that minute window.  Sure enough, no problem, at least until the Cluster Service fails it.

So basically SQL Server is in fact running properly and starting up properly. It’s simply that the Cluster Service can’t confirm it is running so it shuts SQL Server down.

I started to try several various things that all ended up in a blind alley.

Then as I was poking around the SQL Server Configuration Manager on Node B it dawned on me to look at the SQL Native Client and compare it to Node A. The one critical difference was that Node B had some aliases setup.  They looked correct, but following a troubleshooting axiom of mine “One of these things is not like the other” I decided to rename them (not delete them, since another axiom is “don’t break anything you can’t fix”) so they wouldn’t be used.

I then tested the failover, fully not expecting this to solve the problem. The failover worked just fine. Wow. That surprised me.Of course I never trust anything I can’t replicate.  Changed the aliases back to their original form. Test failover. It fails. Change them back to the updated names and things work again.

I had my solution.

Now, my blog is intended to be more about thinking than actual technical issues, but for this I’ll make an exception.  So for future reference, Google and more:

The error I received in the SQL Error logs was:

2013-02-20 08:36:47.74 Logon Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT:]
2013-02-20 08:36:47.74 Logon Error: 17806, Severity: 20, State: 2.

No Googling for this helped.(It’s a common error in other contexts, none were helpful here that I found.)

But otherwise, this was your basic troubleshooting.

  • Eliminate possibilities
  • Try variations
  • When you think you’ve solved it, replicate it.

And, no matter how improbable it is (I never would have guessed Aliases) if you’ve eliminated everything else, it must be that.


One thought on “Eliminating the impossible

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s