Fixing SQL AG Routing "Cannot find host" using packet analysis (and a spell checker)
I was setting up an availability group listener recently and when I tried to connect to the listener I got an error message to say "The host cannot be found" after the usual 15 seconds connection delay.
I checked the usual thing like TCP was enabled on the replicas and that routing was configured, but every time I tried I kept getting the "The host cannot be found" error. Now, I don't know about you, but I find error messages like this completely infuriating especially when I knew the hostname of the listener resolved because I could ping the damn thing (and then connect straight to the IP address.
Now with a listener, you connect and then either connected or you are routed to one of the other hosts. I knew I could connect so guessed that I was being redirected to somewhere else but I didn't know where. Of course, if the error had just said "The host (hostname) cannot be found" I wouldn't be writing this today :)
So I decided what any self-respecting lover of network packets would do, I opened a command prompt google'd "netsh trace start command line not rubbish thingy" to get to this blog: https://blogs.msdn.microsoft.com/canberrapfe/2012/03/30/capture-a-networ... and ran "netsh trace start capture=yes tracefile=.\path\to\trace.etl".
When the trace was running, I quickly reproduced the error and then ran "netsh trace stop" and copied the etl file to my laptop where I had "Microsoft Packet Analyzer" installed. A quick hop skip and a jump over to the etl file in the packet analyzer tool and I wanted to see if I could see a failed DNS request so I went to the "Filter" window and added the word "dns" then clicked "Apply":
This did show a request for the wrong machine in the wrong domain, but how did it get there? When I saw this I went to SQL and the read-only routing URL in sys.availability_read_only_routing_lists and of course it was totally wrong. That was the fix, but not the end of this story dear reader.
Now I had a network trace of routing in action I thought to myself, as I am sure everyone who has ever worked with SQL Server has thought at one stage or another: "I wonder if the routing is handled via a TDS ENVChange token?".
I changed my filter from "dns" to "tds" and then I could see all the tds traffic:
I also changed the columns to include source and destination ports so I could see which packets were from the client and which were from the server. Traffic to the server will have a destination port of 1433 (unless you are mental and change it or aren't mental but are using an instance), the source will be different for every connection.
Looking at the source / destination pairs I could see there was more than one set of connections intermingled so I guessed mine was the one with the source port of 52264 as 62084 was sending SQLBatch tds packets and I wasn't getting that far, so a change of the filter from "tds" to "tcp.port == 52264" and we have a trace with just our traffic:
So what is all this? Well, a SQL Server connection goes like this (forget about silly things like local connections and named pipes, this is just TCP connections obviously):
1 - Create a TCP connection:
Client - Hi Server, I want to create a TCP connection to port (probably) 1433
Server - OK, go on then
Client - Thanks, I will then
The "S" in the "Flags" part of the summary for each packet is short for SYNCHRONIZE and it is TCP's way of saying Hello, the client sends a S, the server responds to the S with an A for ACKNOWLEDGEMENT and sends its own S for SYNCHRONIZE, the client then hilariously replies to the server's S with its own A. To put it another way:
"The SYN, SYN ACK, ACK" TCP handshake.
2 - Create a TDS connection
Now this is where life jumps into the fact lane, go back to google grab the [MS-TDS] spec (https://msdn.microsoft.com/en-us/library/dd304523.aspx) and read it :). Done? Good so you know the sequence to create a TDS connection over the previous TCP connection is:
First the client sends a PRELOGIN packet, this is unencrypted and works with basically any decent version of TDS from 4.2 upwards.
After the PRELOGIN packet, we get some TLS traffic so SQL can encrypt the LOGIN7 packet. This can be "man in the middle'd" unless you have "Trust server certificate" set to false, so if your not on a LAN make sure you always do that.
We'll skip the TLS setup stuff and can't see the response to the LOGIN7 packet as it is encrypted but we do see the response to the LOGNI7 packet which includes some NTLM setup guff which we don't care about here.
Once the connection is validated and the user logged on, we get the response to the login:
Now what happens in a login response is SQL says lots of nice things to us like "your database is now XXX" and "your language is XXX". It does this by sending a series of tokens that are the ENVCHANGE (environment change) token and also an INFO token which is the message you see written to the "messages" window in SSMS.
If we look at some of the tokens we see our first token is an ENVCHANGE with a token type of 1 = DATABASE which is a change database context token:
I'll leave you to go through any other TOKENS you like and jump onto the last token in the packet which is again an ENVCHANGE but packet analyzer doesn't recognise the TOKEN which is of type 20 (0x14). Jump back to the [MS-TDS] docs for ENVCHANGE TOKEN 20 and we see:
HOOPLA!, token 20 is an ENVCHANGE token that routes the client from one host to another, if we examine it we see the name of the server we will be redirected to:
Anyway, now to put the correct domain name in the routing list!
PS, Just to finish off the connection we can see the F flag sent from the client, the server responding with its own F and an A and then the client sends its own A and then nothing else happens. I will leave you to figure what this last bit was about and for bonus points have a look at why people don't use R or RESET (or RST) to close connections!