Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

The above error message is common when working with SQL Server, however the generic nature of the error message means that inexperienced developers go down a rabbit hole trying to figure out the root cause. Yesterday one such event occurred in one of our production databases where a stored procedure executed via Entity Framework threw the below error.

Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding

After a day of troubleshooting, I was asked to review a fix of adding NOLOCK hint to a query that performs select operation on a table. The developers felt the root cause was a deadlock being caused when an import process was being run. This raised alarm bells for me since I generally prefer not using Hints of possible. But on the internet more often than not you will find advice on hints being recommended even as the first step. In this post I cover how to troubleshoot the issue and fix it.

Why I DO NOT prefer using Hints to fix issues?

Because it’s in a bright red box of the documentation itself.

First let’s understand Timeouts

A timeout can occur in SQL Server as well as any other software such as IIS etc. It simply means the software tried to perform an operation and got stuck waiting for it to complete. After waiting a long enough time, the software decided to timeout rather than wait indefinitely. An example of a timeout in real life would be when you are waiting to checkout items at the grocery store. You wait for the queue to move forward, but after a wait if the queue doesn’t move forward you decide to walk out without buying anything.

Difference between Connection timeout and Query time out

A connection time out occurs when SQL Server is trying to establish a connection to the server and isn’t able to. You may often find this issue as the first indicator when there is memory pressure. Every new connection to SQL has to have a small amount of memory allocated to it in order to perform its tasks. When SQL in unable to provide the minimum memory required you will get a connection time out before you get a out of memory exception. A connection timeout usually means there is a server wide issue such as network connectivity is broken or there is memory pressure etc.

On the other hand, a Query timeout as its name suggest occurs when SQL Server is executing a query and the server is taking too long to grant resources required for this query. The error message mentioned in the title of this post is related to query timeouts. Keep in mind if a query takes a long time to execute but it’s still returning rows to the output then it won’t timeout. Basically, there is progress being made; it is just slow. In such cases there is no timeout and the query may keep executing for two hours if needed as long as it is not stuck. A timeout occurs when the query is trying to access resources like memory, locks, threads etc. Basically, in order to execute the query, it requires a Lock on a table to be released or some memory to be allocated or there is a shortage of CPU threads. In such cases the query is waiting for the resource to be granted but the process is taking time. The amount of time spent waiting is based on the complexity of the query. The more complex the query the longer the wait time. This ensures that users do not wait excessive amounts of time for even small queries that should have finished quickly. 

How to find out the reason for the query time out?

Logging and Monitoring. I cannot stress enough how important it is to ensure you have proper logging and monitoring setup on the database servers. This itself can be a whole series of posts. In this case the database is hosted in Azure and therefore we have a comprehensive suite of products to log / audit and monitor the database.  The one developers are most familiar with would be Log Analytics. This is a feature where Azure stores Usage metrics and Additional data such as every query and error message generated on the database server. These logs are then saved in a NOSQL database called Kusto which has its own query language called KQL, it is a very easy query language to learn and you can refer the end of this post for some basic pointers on learning the language.

By writing the appropriate KQL query I get the below details for the time period when the error occurred and we can see clearly the reason for the timeout was a TRY CATCH Block that didn’t rollback a transaction properly after a delete operation failed.

The above clearly indicates that the issue was caused by the table being unavailable as result of a failed transaction rollback. At this point it should be clear that adding a HINT would not have helped at all in preventing the issue. It is easy to assume the query time out is the result of poor indexing, deadlocks etc but identifying the root cause correctly will save you a ton of trouble. 

How to be sure?

If you cannot reproduce the issue based on your assumption you need to dig deeper. Implement Monitoring and Logging. At the bare minimum you need to have query store and Usage Metric enabled when using Azure, when using on Prem, use Query Store + XE + default trace to narrow down the root cause. Enabled more advanced troubleshooting tools later to gather more details. I was immediately suspicious because of the NOLOCK hint but often it may not be so obvious. Which is why it’s important to recreate the issue. It is the only way to validate the solution works and more importantly that the correct root cause has been identified.

Once we identify that the timeout error is a symptom of another error (Failed rollback) it’s simply a matter of fixing the underlying query. In this case the order of events for the try catch block and the correct spelling of the transaction name are the first solution that needs to be implemented.

REFERENCES

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/troubleshoot-query-timeouts