Sunday, September 19, 2010

An IIS Crash Analysis Story

Last week I attempted to update a high-traffic production ASP.NET application to ASP.NET 4.  In the course of doing so, I was surprised, despite having tested everything thoroughly in a staging environment, to find that under production loads, the system was erratic and slow to respond to requests.  Further inspection revealed that IIS was actually crashing under load, leaving a cryptic message in the event log but no other clues.  This is basically an account of what I did, and how I ultimately resolved the issue, for my own future reference and to assist others who may face similar situations.
In The Beginning
The application in question is a fairly high traffic site, with several servers in a Network Load Balanced / Web Farm environment, and each server getting 50-100 ASP.NET requests per second, typically.  The application makes heavy use of ASP.NET caching, and you can see some charts showing the behavior prior to moving to .NET 4 in my post about monitoring and tuning asp.net caching.  Note in particular the Windows Task Manager behavior, with a steadily climbing memory footprint eventually followed by a “cliff” when Cache Trims occur and memory is freed.  This, while not ideal, represented the status quo for the application and was what I expected to see after upgrading the application to .NET 4, with little significant variation.
Preparing for the Upgrade
The application has few individual ASP.NET pages, and in any event the upgrade from ASP.NET 3.5 to ASP.NET 4.0 was fairly painless.  The biggest issue was replacing instances of my own Tuple class with the new Tuple that is now part of the .NET framework.  With that done, I was able to test the application locally and all appeared well.  Tests passed.  Pages loaded.  I checked everything in and the build server confirmed everything looked good.  I deployed to stage, updated IIS there to set the appdomain to .NET 4, and tested again.  Still good – no worries.  This was starting to look like a walk in the park.
Performing Updates to a Web Farm using Windows NLB
I have another post detailing how to perform a rolling upgrade of a web farm using Windows Network Load Balancer.  These are pretty much the steps I followed to test the move to .NET 4.  I pulled one server out of the cluster, copied the stage site to the production site, flipped the appdomain to .NET 4 in IIS, and tested on localhost.  Everything looked great (as it should, since it was now configured exactly like the stage site which had also tested OK).  I did a Resume and Start on the node I’d pulled out using NLB Manager, and watched Perfmon and Task Manager to see how it fared.
Surprise!  Finding the Problem
Read more: Steve Smith