Last week I finally managed to hunt down and resolve a bug that I had been chasing for quite some time. A couple of years ago I built an ASP.NET web service that makes use of a native library to provide most of its functionality. This native DLL exposes a C API, much like the Win32 API, that we’ve been using for integrating a highly-expensive product from a certain vendor into our system.
I didn’t notice any issues during the initial development of this web service, in fact, I was very pleased with how easy it was to use and integrate this API. Until this very day, I still consider this as one of the nicest and most stable C API’s I’ve ever come across. After I finished most of the features for the web service, I ran a couple of load tests in order to determine whether the respective product could withstand a fair amount of requests and also to detect any glitches that might come up in the interop layer. Nothing major turned up, so we decided to bring this service into production and go on with out merry lives.
Aft first we didn’t receive any complaints. Everything worked out as expected, until earlier this year, we noticed some failures of the application pool for our web service in IIS. The event log showed some random crashes of the CLR (= very bad) with some highly obscure error message. These issues occurred completely random. One day, there were about five to six application pool crashes after which it behaved very stable again, sometimes for months in a row.
After doing some investigation on my part, which also involved stretching my shallow knowledge of WinDbg, I found out that .NET runtime was doing a complete shutdown after an AccessViolationException being thrown. This exception reported the following issue:
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
The first thing I considered was a memory leak turning up somewhere. It’s unmanaged code after all, right? After reviewing the code that uses the imported native functions, I discovered two potential memory leaks that might come up during some edge case scenarios. So I fixed those, but unfortunately it didn’t resolve the problem.
I tried a couple of other things and also approached the issue from a couple of different angles. I’m not going to bore you with the details here, but it clearly didn’t solve anything. I started to become a bit frustrated with this problem. At some point, I was even convinced that this was all caused by the native library itself and not in our code, which is a classic mistake in most long debugging sessions. The good thing was that I was able to reproduce the problem on my machine, by running the load tests with a particular high load. I noticed that there was also a certain pattern in the number of requests between each crash of the application pool.
Vastly determined to fix this, I decided to read up on some information about P/Invoke and marshaling in .NET. I ended up reading this blog post on P/Invoke and memory related issues. While it didn’t provide a clear solution, reading that blog post certainly guided me in the right direction. I started to turn off feature by feature until I was able to isolate the cause of the crash to a single function call. The native function in question has the following signature:
As it turned out, I created the following P/Invoke prototype for this native function in C#:
At first sight (and even after a number of subsequent reviews), there seems to be nothing wrong with the way the char * return value is marshaled to a String. I started reading the excellent documentation for this function again, and it explicitly mentioned that the caller should not cleanup the memory for the return value. The allocated memory for the return value is always cleaned up by the native function itself.
Doing some more research on marshaling strings, I found out that in case the native function does its own cleanup, an IntPtr must be returned by the C# prototype declaration instead of a String.
Most c-style strings returned by native functions must be cleaned up by the calling code. For this common scenario, when the return value is marshaled to a String, the interop marshaler assumes that it has to free the unmanaged memory returned by the function. In our case, this actually means that the interop marshaler tried to cleanup memory that at some point was already freed up by the native function or vice versa.
So I changed the return value for the P/Invoke prototype declaration to an IntPtr. This way, the interop marshaler does not automatically free the memory that is referenced by the IntPtr.
char* FunctionThatReturnsAString();
Marshaling the data referenced by the IntPtr to a string is quite easy. This can be done using the Marshal.PtrToStringAuto method.
The lesson I learned here is that we need to watch out for these kind of issues when using interop with unmanaged code. On the surface everything seems to work out just fine, but one can still run into some nasty issues afterwards. Carefully considering how to correctly marshal data from and to unmanaged code is an essential technique and sometimes feels like a black art that is not suited for the faint of heart, like myself :-).