Friday, August 9, 2013

A Debugging Nightmare

A few weeks ago, I ran into the most complicated bug I think I have ever had to solve in my entire programming career.

We are developing a fairly complex system that involves 4 WCF services, a SQL database, and an integration component to a Microsoft Access front-end application.  The bulk of the system involves synchronizing data between the Microsoft Access database and the SQL database via XML files.  The system was largely developed by a 3rd-party contractor who came in-house on a particular day so that we could work together to try to resolve the issue.

The basic problem was that initiating the sync would work fine when we manually started it via a stored procedure in SQL server, but when doing an end-to-end process from Microsoft Access, it failed every time.  The two calls should have been 100% identical, because we were manually calling the stored procedure that eventually gets called by Microsoft Access.  We could even demonstrate through our logging that the exact same stored procedure was being called in both cases with the same parameters, but it would only work when manually run.

We traced the calls through using a combination of database logging, text file logging, and Fiddler tracing to try and see what was going on.  There was nothing we could see different about the two requests and no clear reason why it would fail, until suddenly we stumbled on a clue.  When running the end-to-end test from Microsoft Access, it would fail after 30 seconds with a timeout error message.

At first, we thought one of the WCF services was timing out because the error message looked exactly like a web service timeout.  But eventually with the help of Fiddler (which I *highly* recommend btw!) it was clear the error message came from the server-side via a FaultException, not from the client-side.  So the error was occurring inside the service.  Eventually, I pinpointed it down to a single database call that was generating a timeout error, but only when done through the end-to-end client call.

It wasn't until I pulled out the stacktrace for the FaultException and tracked down the exact line with the error that I had the "aha!" moment.  Turns out, the real problem was a timeout caused by the process running in a transaction.  There is a record modified right before the stored procedure is called, which then calls .NET code that tries to modify the same record, but the table has been locked.  Apparently, the .NET code called by a stored procedure is actually a separate session from the caller, and so they get into deadlock situation.  Once I saw the error message, it was immediately obvious what the problem was.  I simply removed the conflicting update statements and now it works perfectly.

Happy programming!

No comments:

Post a Comment