Friday, August 23, 2013

An Error? Carry on anyway!

Lately I've had to do a lot of development in our core internal application which is responsible for managing data flow between our clients systems and our network.  This system also coordinates data for some of our 3rd-party partner integrations, such as Pay-as-you-Go Worker's Compensation.  Multiple times throughout any day, a number of scheduled tasks are run which imports data into databases, downloads and processes XML files, generates and uploads more XML files, and other similar large-scale data processing tasks.

So you'd think a system at this core level should be well-designed, right?  Wrong!  There are a number of faults in the architecture of this system, but by far the most blatant is a pattern called "Exception? Won't happen!"  It's the most pernicious of them all because it results in a number of complex, hard-to-debug data defects:

  • Data partially processed, or processed incorrectly and has to be manually rolled back.
  • Files that are generated, but not uploaded properly.
  • Multiple day's worth of data just "missing" from generated files.
Here's what the pattern looks like in a nut-shell:

Try
  PrepareData()
Catch ex As Exception
  LogException(ex)
End Try

Try
  GenerateFile()
Catch ex As Exception
  LogException(ex)
End Try

Try
  ArchiveData()
Catch ex As Exception
  LogException(ex)
End Try

Try
  UploadFile()
Catch ex As Exception
  LogException(ex)
End Try

At first glance you might think there's nothing wrong - the developer has wisely adding exception handling and logging to the system to make sure errors can be detected and reviewed.  The problem comes if something does indeed fail.  For example, what happens if "UploadFile()" process fails?  Well, the first three processes will have already finished and been committed.  The data has been archived permanently, but the generated file never got uploaded to the 3rd-party network.  That means they will never receive the data, and we will never send it again because it's been marked "complete"!  Apparently the developer assumed that "nothing could go wrong".

Resolving this defect can be a little time-consuming, but definitely worth it. I generally approach it this way:

  1. Wrap all sub-steps in some sort of transaction.
  2. Pass the same transaction to each method.
  3. If the entire process (work unit) is successful, commit the transaction at the end.
  4. If anything fails on any step, roll the entire process back, including deleting any temp files that were partially created.
  5. Everything is left in a consistent state then for another try in the next processing cycle.
Just for fun, here's another gem from the same code-base that occurred while stepping through with the debugger in Visual Studio (the highlighted line is the next statement to execute):



Happy programming!

Friday, August 16, 2013

Windows Azure RDP

So the other day I was looking for a good solution to managing multiple RDP windows. In a normal day I will be working with up to 10 different RDP sessions at once, and it gets annoying managing all those icons on the taskbar.  After hunting around and comparing a few different products, I settled with Terminals, which is an open-source RDP session manager hosted on CodePlex.  My ideal RDP manager would have the following core properties:

  • Has good connection management - groups, saved credentials, etc.
  • Captures special keys correctly (Windows key, Alt+Tab, Ctrl+Shift+Esc, etc.)
  • Supports Windows Azure
Terminals comes close, because it handles the first two very well, but unfortunately, it does not support Windows Azure.  However, since it is open-source and written in C#.Net, I thought I would just modify it myself to add support.

If you've worked with Windows Azure, you'll be familiar with the Connect link at the bottom of the Azure Portal when viewing information about a service instance.  That link gives you an .RDP file to download which can be used to connect to the instance.  If you type in the public instance name (name.cloudapp.net) into an RDP session without going through the .RDP file download, you'll quickly discover that it doesn't work.  So what is different about generating the .RDP file directly from the portal?

When you open the file, there is a special property value unique to Windows Azure that is required for the connection.  Without it, you get no response whatsoever, and no idea why your connection is not working.  I guess that's a security feature by design in Azure.  Here's an example file (with values redacted):

full address:s:subdomain.cloudapp.net
username:s:bobbytest
LoadBalanceInfo:s:Cookie: mstshash=subdomain#subdomain_IN_0
You'll notice there is a special value called "LoadBalanceInfo" which is not present on typical RDP sessions.  So I added the property to the UI of the Terminals program and modified all the places that generate RDP sessions to make use of the property as well.  However, every time I would try to connect, still no response.  After doing a little research, I became convinced that I was just missing a small detail and WireShark (network capture software) would provide the answer.

With WireShark I quickly discovered exactly what the issue was - my strings were getting sent over the wire as UTF-16, which means every character was 2 bytes.  Here is an example of what the TCP conversation dump looks like:

C.o.o.k.i.e.:. .m.s.t.s.h.a.s.h.=.s.u.b.d.o.m.a.i.n.#.s.u.b.d.o.m.a.i.n._.I.N._.0.

The dots in between each character are 0-byte characters making up the other half of the 2-byte unicode characters.  Since the class being used for the RDP connections is an ActiveX component provided by Microsoft, actually modifying the class was out of the question.  But there is a way - re-encode the string as UTF-8 by squishing two characters together into each 2-byte character.  It becomes a really weird string if you try to interpret as UTF-16 (the default in .NET), but the ActiveX class which reads as 1-byte characters works beautifully:

var b = Encoding.UTF8.GetBytes(temp);
var newLBI = Encoding.Unicode.GetString(b);

The only catch to this solution is that if the original string has an odd number of characters, the last character will still be a 0-byte and be rejected.  Simple solution is to pad with an extra space at the end, which seems to work okay.  Add to that the requirement that LoadBalanceInfo always be suffixed by "\r\n", and the full working solution is below:

var temp = LoadBalanceInfo;
if (temp.Length % 2 == 1) temp += " ";
temp += "\r\n";
var b = Encoding.UTF8.GetBytes(temp);
var newLBI = Encoding.Unicode.GetString(b);

Magically, it works!  I created and uploaded a patch that adds support for Windows Azure to the Terminals website, so you can try it for yourself.  Happy programming!

Friday, August 9, 2013

Basic Auth > Anonymous Auth

Had a very strange issue today.  We have a WCF-service based system we're deploying on a client's network.  One of the services uses Basic Authentication for normal calls, which is handled via a custom HttpModule.  However, one specific subfolder of the WCF service (/downloads/) we wanted to have only anonymous auth so that files could be downloaded without a password.

It seemed like it should be relatively straightforward.  I modified the logic in the Basic Auth module to skip authentication step for any path starting with /downloads/.  It worked beautifully in our testing environment.  However, the problems began when we moved the code into our client's network.  Every time I would try to access a url containing /downloads/, I would incorrectly get the Basic Auth prompt, even though it was supposed to be exempted.

In an attempt to debug the issue, I commented out the Basic Auth module completely from web.config so the website would use Anonymous Auth globally.  However, when I tried to access any path in the service from a web browser, it would generate a 401.3 error, which is a physical ACL access denied error.  It didn't make any sense that the account should be disabled from access because the identity pool account for the IIS website had full permissions to the folder containing the service files.

After doing a little research I discovered that the account used by default for anonymous auth is specified separately from the account for the identity pool.  Even if you specify in the global Website Basic Settings that the "Connect As" should be Pass-though (Application Pool Identity), that is separate from the setting for Anonymous Auth.  Turns out if you right-click on the Anonymous Auth setting in an IIS site, you can specify the account used for requests made as Anonymous, and by default the account is IUSR.  We changed this to use the application pool identity and it started working beautifully.

However, this leaves me somewhat puzzled as to how the Basic Auth account was working when the Anonymous Auth is not working.  The Basic Auth accounts are tied to a database, which is entirely segregated from the ACL-level permissions in windows, which are tied to Active Directory in the network.  Apparently, it seems that just by fact of using Basic Auth with any account, it uses the Application Pool Identity, but if you have no username at all, then it assumes the Anonymous Auth default user setting - regardless of whether your Basic Auth username has anything to do with the network.  Very unexpected behavior, and very frustrating to debug.

Happy programming!

A Debugging Nightmare

A few weeks ago, I ran into the most complicated bug I think I have ever had to solve in my entire programming career.

We are developing a fairly complex system that involves 4 WCF services, a SQL database, and an integration component to a Microsoft Access front-end application.  The bulk of the system involves synchronizing data between the Microsoft Access database and the SQL database via XML files.  The system was largely developed by a 3rd-party contractor who came in-house on a particular day so that we could work together to try to resolve the issue.

The basic problem was that initiating the sync would work fine when we manually started it via a stored procedure in SQL server, but when doing an end-to-end process from Microsoft Access, it failed every time.  The two calls should have been 100% identical, because we were manually calling the stored procedure that eventually gets called by Microsoft Access.  We could even demonstrate through our logging that the exact same stored procedure was being called in both cases with the same parameters, but it would only work when manually run.

We traced the calls through using a combination of database logging, text file logging, and Fiddler tracing to try and see what was going on.  There was nothing we could see different about the two requests and no clear reason why it would fail, until suddenly we stumbled on a clue.  When running the end-to-end test from Microsoft Access, it would fail after 30 seconds with a timeout error message.

At first, we thought one of the WCF services was timing out because the error message looked exactly like a web service timeout.  But eventually with the help of Fiddler (which I *highly* recommend btw!) it was clear the error message came from the server-side via a FaultException, not from the client-side.  So the error was occurring inside the service.  Eventually, I pinpointed it down to a single database call that was generating a timeout error, but only when done through the end-to-end client call.

It wasn't until I pulled out the stacktrace for the FaultException and tracked down the exact line with the error that I had the "aha!" moment.  Turns out, the real problem was a timeout caused by the process running in a transaction.  There is a record modified right before the stored procedure is called, which then calls .NET code that tries to modify the same record, but the table has been locked.  Apparently, the .NET code called by a stored procedure is actually a separate session from the caller, and so they get into deadlock situation.  Once I saw the error message, it was immediately obvious what the problem was.  I simply removed the conflicting update statements and now it works perfectly.

Happy programming!