Troubleshooting multi-threading problems related to Android’s onResume()

This post is a continuation of a previous post I wrote about best practices for using onResume(). I found a particularly testy bug that caused me 2 hours of pain time to track down. The tricky part was it would only show up when there was no debugger attached. Right away this told me it was a threading problem. I suspected that the debugger slowed things down just enough that all the threads could complete in the expected order, but not the actual order that occurred when running the device in stand-alone mode.

The test case. This is actually a very common workflow, and perhaps so common that we just don’t think about it much:

  • Cold start the application without a debugger attached. By cold start I mean that the app was in a completely stopped, non-cached state.
  • Minimize the app like you are going to do some other task.
  • Open the app again to ensure that onResume() gets called.

Now, fortunately I already had good error handling built-in. I kept seeing in logcat and a toast message that a java.lang.NullPointerException was occuring. What happened next was troubleshooting a multi-threaded app without the benefit of a debugger. Not fun. I knew I had to do it because of the visibility of the use case. I couldn’t let this one go.

How to narrow down the problem. The pattern I used to hunt down the bug was to wrap each line of code or code block with Log messages like this.

setLocationListener(true, true);

Then I used the following methodology starting inside the method were the NullPointerException was occurring. I did this step-by-step, app rebuild by app rebuild, through the next 250 lines of related code:

  1. Click debug in Eclipse to build the new version of the app that included any new logging code as shown above, and load it on the device.
  2. Wait until the application was running, then shutdown the debug session through Eclipse.
  3. Restart the app on device. Note: debugger was shutdown so it wouldn’t re-attach.
  4. Watch the messages in Logcat.
  5. If I saw one message , such as Test1, followed by the NullPointerException with no test message after it, then I knew it was the offending code block, method or property. If it was a method, then I followed the same pattern through the individual lines of code inside that method. This looked very much like you would do with step-thru debugging, except this was done manually. Ugh.

What caused the problem? As time went on, and I was surprised that I had to keep going and going deeper in the code, I became very curious.  It turned out to be a multi-threading bug in a third party library that wasn’t fully initialized even though it had a method to check if initialization was complete. The boolean state property was plainly wrong. This one portion of the library wasn’t taken into account when declaring initialization was complete. And I was trying to access a property that wasn’t initialized. Oops…now that’s a bug.

The workaround? To work around the problem  I simply wrapped the offending property in a try/catch block. Then using the pattern I described in the previous blog post I was able to keep running verification checks until this property was either correctly initialized, or fail after a certain number of attempts. This isn’t 100% ideal, yet it let me keep going forward with the project until the vendor fixes the bug.

Lessons Learned. I’ve done kernel level debugging on Windows applications, but I really didn’t feel like learning how to do it with one or multiple Android devices. I was determined to try and narrow down the bug using the rather primitive tools at hand. The good news is it only took two hours. For me, it reaffirmed my own practice of implementing good error handling because I knew immediately where to start looking. I had multiple libraries and several thousand lines of code to work with. And, as I’ve seen before there are some bugs in Android that simply fail with little meaningful information. By doubling down and taking it step-by-step I was able to mitigate a very visible bug.