This is the story of a wicked bug. Just like in a good Agatha Christie story, the criminal has been hiding, leaving behind false evidence, framing innocents, and resisting almost any logic and attempt to be identified.
If you want to play along, dear reader, read this little blog post as you would a crime novel. Don’t flip to the end to se who the killer is. Try to guess along with the detective, and see at what point you can pinpoint the guilty one. If you only want to learn from our mistakes, you can skip to the end (after the coffee).
Here is how it starts. We are working on a new Silverlight 4 application – we’ve been working on it for quite a while in fact. It was coming along nicely. We’ve been showing it to the customer for a while, and he was pleased. But as Silverlight 4 has been released, the client decided to give it a whirl on his own computer. And he came back to us with very weird error reports and screenshots. Statuses did not get transitioned, lists did not get populated. But only some. Most of the app worked, but at some places, it did not. Of course, everything was working fine on our computers.
Maybe his computer has a virus? He tries it on a Mac, same results. Probably not a virus then. We double check the he is using the same URL. We clean his entire browser cache, try two other browsers. Works on our machines (tried several ones in fact), and behaves very strangely on his.
What is the big difference between his machines (two different operating systems), and ours? Well, we as developers have the developer runtime of Silverlight 4, and he has the end user runtime. So, I make him install the developer runtime. Suddenly everything starts to work. Ouch. This means that there is a difference (and a big difference!) in how the developer and the end user runtime behaves. No good. Microsoft shouldn’t allow this to happen! We install a fresh new Windows 7 on a virtual machine, install only the Silverlight end user runtime on it – and surely enough, we see the same strange behavior.
So, we know where to look for the little bugger, but have no idea what it is. The project is pretty complex, and with some things working and some not, it is very hard to find. In addition to that, you cannot debug an end-user runtime – the debugging support is missing from it in order to reduce the size of the download. How to get to the end of this?
I contacted Microsoft, and after I showed them the url that behaves differently in the two runtimes, they jumped onto the problem after confirming the issue themselves. However, before we could get clearance from our customer to send source code to Microsoft, another catastrophe stroke.
In another project, we had a similar error. Displaying of a number from a server side xml file worked for us, and it did not work for those with end-user runtime (always showed 0). Fortunately, this part of the application was a lot simpler, and although the problem did not affect states (as in the other app), we hoped that we just caught a different manifestation of the same bug.
As we could not attach a debugger to the end-user runtime (which was the only one with the bad behavior), we started to add trace logs. We added debug lines to every statement in the code path from the xml file to the viewmodel, from which the number was displayed via binding. We speculated that it may be an xml parsing bug – that would explain both the problems with displaying the number from the xml document, and some strange xml parsing bug could also cause wrong interpretation of states in the first app.
The trace lines came back with good results. Meaning, it did not show any signs of xml parsing errors, and the number to be displayed was passed along very nicely to the viewmodel. The prime suspect now is the binding mechanism. But there is another number displayed next to the wrong one, which also comes from the xml, which is also integer, and which works as it should. We were clueless again, but determined to help the Silverlight team by narrowing down the issue as much as possible.
Then we noticed, that the display was working well again on the end-user runtime! WTF? We only added a dozen trace lines to the app, none of which could have any effect on the number we were displaying. But still, after removing the trace calls, things are not working anymore. This situation looks like a living proof of Schroedinger’s theory in modern computer science. Well, not really, but I wanted to have this picture in the post:
OK, so now we start removing trace lines. The plan is to find the one line that fixes the problem when it is there, and without which the number does not get displayed. We ended up with the trace line in the property setter for the number that is displayed. The code for the property looked like this:
public int TotalCount
{
get { return _totalCount; }
set { PropertySetter<int>(ref _totalCount, value); }
}
PropertySetter is a method from the ViewModelBase class, and it is how the project handles the INotifyPropertyChanged calls. Here is the code:
protected bool PropertySetter<T>(ref T p, T newValue)
{
StackFrame frame = new StackFrame(1);
StackTrace trace = new StackTrace(frame);
string name = frame.GetMethod().Name.Substring(4);
return PropertySetter(ref p, name, newValue);
}
protected bool PropertySetter<T>(ref T p, string propertyName, T newValue)
{
if (CompareValues(ref p, newValue))
{
p = newValue;
RaisePropertyChanged(propertyName);
return true;
}
else
{
return false;
}
}
What happens here is that we have achieved a one line way of implementing property setters, including raising the PropertyChanged event, even with optimizations to only raise it when the value actually changes. It goes up in the stack to get the name of calling method (which is the property setter), finds out the property name from it, and even sets the value of the property’s backing field. Pretty clever solution to an old problem.
So, dear reader, at this point you have every information we had to find the bug. Again, the symptom was that the application only displayed 0 for the TotalCount. Where is the bug? Why is everything OK, when we add a trace line to the property setter of TotalCount? Do we blame Microsoft? Do we blame us? Do we blame the stars? And most importantly, can we fix the bug without waiting for Microsoft to release a fixed end-user runtime?
Here, have a coffee to help you think, before I tell you who the killer was.
If you guessed that the problem is with the StackTrace approach, you were correct. The thing is, the Just-in-time compiler has some serious optimizations. For example, it inlines methods under certain circumstances. When the TotalCount property was set, the property setter method got inlined. Therefore, walking up the call stack by one did not result in the _setTotalCount method that the property setter was expecting. It returned the name of the method that called the setter. Because of this, the property name for the NotifyPropertyChanged method was wrong, and thus the Silverlight runtime never got notified that TotalCount has been set. Silverlight kept displaying the default (0) value for TotalCount. But for other properties, where inlining did not happen because the calling method was more complex, all worked perfectly.
And why did the adding of the trace line fix the behavior? Well, it seems like inlining does not happen when there are more than one statements in the method that is being called.
We quickly changed the NotifyPropertyChanged implementation to another one which did not rely on call stack, and voilá, everything was back to normal, on both projects. Call stack has been found guilty, and is now behind bars.
The moral of the story? Don’t rely on call stacks, unless you know what you are doing. And even then, think twice.
Do you think this is a bug that Microsoft should fix? Did you find the bug without peeking? Do you think we were stupid to use this approach in the first place? Let me know!
Posted
May 06 2010, 07:33 PM
by
vbandi