Thursday, November 18, 2010

The Black Art of P/Invoke and Marshaling in .NET

Last week I finally managed to hunt down and resolve a bug that I had been chasing for quite some time. A couple of years ago I built an ASP.NET web service that makes use of a native library to provide most of its functionality. This native DLL exposes a C API, much like the Win32 API, that we’ve been using for integrating a highly-expensive product from a certain vendor into our system.
I didn’t notice any issues during the initial development of this web service, in fact, I was very pleased with how easy it was to use and integrate this API. Until this very day, I still consider this as one of the nicest and most stable C API’s I’ve ever come across. After I finished most of the features for the web service, I ran a couple of load tests in order to determine whether the respective product could withstand a fair amount of requests and also to detect any glitches that might come up in the interop layer. Nothing major turned up, so we decided to bring this service into production and go on with out merry lives.
Aft first we didn’t receive any complaints. Everything worked out as expected, until earlier this year, we noticed some failures of the application pool for our web service in IIS. The event log showed some random crashes of the CLR (= very bad) with some highly obscure error message. These issues occurred completely random. One day, there were about five to six application pool crashes after which it behaved very stable again, sometimes for months in a row.
After doing some investigation on my part, which also involved stretching my shallow knowledge of WinDbg, I found out that .NET runtime was doing a complete shutdown after an AccessViolationException being thrown. This exception reported the following issue:
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
The first thing I considered was a memory leak turning up somewhere. It’s unmanaged code after all, right? After reviewing the code that uses the imported native functions, I discovered two potential memory leaks that might come up during some edge case scenarios. So I fixed those, but unfortunately it didn’t resolve the problem.

Read more: <elegantc*de>