troubleshoot – The Page Not Found Blog

Everyone’s internet connectivity experience is unique and it can vary from minute to minute. Most internet users can sense slowdowns, and everyone can identify when a connection fails. Web developers absolutely rely on a web connection to build web pages. So, when our internet connection goes down our productivity comes to a halt.

I’ve lost count of the number of times I’ve reported to various tech support organizations that I wasn’t able to reach a particular website or web service and was told by the tech: “I was able to reach it just fine.” This happened again today when I called my DSL provider to inform them our internet service went down completely and then was degraded to 1/10 of what we were paying for (e.g. ~1.12 Mbps on a 12 Mbps service). They told me that the line was stable. Although I’m not real sure what stable means. Then the speed gradually increased back up to normal of over the next hour and a half. This has happened about a half dozen times over the last three months.

As a web developer, you load web pages up to several hundred times per day. I almost always have monitoring tools hooked up that give the exact time to download a page and its associated elements. So, I have a good idea of when the internet is performing well, and when it isn’t. Because of this I’ve become sensitized to small, millisecond changes in download times.

I also gained extensive knowledge of internet connections when working on high availability systems with up to five-nines uptime. We deployed systems that monitored web traffic all over the U.S. 24×7. I was amazed to see that internet traffic was very much like our roadways. Sometimes traffic is moving fast, other times it’s slow in spots, and sometimes it’s completely stopped or even re-routed.

In many cases, a modem (or router, as I’m using the terms interchangeably) simply locked up. This is quite common as these devices often run a small linux-based operating system that can occasionally flake out. I can say with certainty in the cases where my DSL modem/wireless router didn’t die, and there was no internet connection, then in 9 out of 10 of these cases it was a problem upstream with the carrier.

Guidelines. So, here are some guidelines for helping you narrow down where the problem might be:

– Check the modem connectivity lights. Usually if a modem is connected to the internet, the connectivity light will be a steady or flickering green. Red or no connectivity light almost always means no connection. It should be a matter of reflex to simply restart the modem and see if that fixes the problem.

– If the internet connectivity light doesn’t come back after restarting the modem, then call tech support.

– On rare occasions (1 out of 10), restarting the server plus the modem restored connectivity.

– Still no service? You can go get a cup of coffee then come back later and recheck.

– Or, if the internet connection light is green, try blowing away the browser cache and try to reload? Sometimes old versions of pages can stick in the cache.

– Can you load any other websites? If you can, then your particular server or service is most likely down.

– Can you ping the server? (for servers that allow ping). Determines if the server has basic connectivity.

– Can you run a tracert? Let’s you look at the connectivity between you and the remote server.

– Document the problems so you have a record for future reference.

– If you need continuous monitoring with alert thresholds, then look into evaluating continuous monitoring tools such as Paessler.

– If you know how to get the basic troubleshooting out of the way, or if you’ve already done it, then insist on escalation when you call tech support. You need to get back to coding as fast as possible.

This post is a continuation of a previous post I wrote about best practices for using onResume(). I found a particularly testy bug that caused me 2 hours of pain time to track down. The tricky part was it would only show up when there was no debugger attached. Right away this told me it was a threading problem. I suspected that the debugger slowed things down just enough that all the threads could complete in the expected order, but not the actual order that occurred when running the device in stand-alone mode.

The test case. This is actually a very common workflow, and perhaps so common that we just don’t think about it much:

Cold start the application without a debugger attached. By cold start I mean that the app was in a completely stopped, non-cached state.
Minimize the app like you are going to do some other task.
Open the app again to ensure that onResume() gets called.

Now, fortunately I already had good error handling built-in. I kept seeing in logcat and a toast message that a java.lang.NullPointerException was occuring. What happened next was troubleshooting a multi-threaded app without the benefit of a debugger. Not fun. I knew I had to do it because of the visibility of the use case. I couldn’t let this one go.

How to narrow down the problem. The pattern I used to hunt down the bug was to wrap each line of code or code block with Log messages like this.

Log.d("Test","Test1");
setLocationListener(true, true);
Log.d("Test","Test2");

Then I used the following methodology starting inside the method were the NullPointerException was occurring. I did this step-by-step, app rebuild by app rebuild, through the next 250 lines of related code:

Click debug in Eclipse to build the new version of the app that included any new logging code as shown above, and load it on the device.
Wait until the application was running, then shutdown the debug session through Eclipse.
Restart the app on device. Note: debugger was shutdown so it wouldn’t re-attach.
Watch the messages in Logcat.
If I saw one message , such as Test1, followed by the NullPointerException with no test message after it, then I knew it was the offending code block, method or property. If it was a method, then I followed the same pattern through the individual lines of code inside that method. This looked very much like you would do with step-thru debugging, except this was done manually. Ugh.

What caused the problem? As time went on, and I was surprised that I had to keep going and going deeper in the code, I became very curious. It turned out to be a multi-threading bug in a third party library that wasn’t fully initialized even though it had a method to check if initialization was complete. The boolean state property was plainly wrong. This one portion of the library wasn’t taken into account when declaring initialization was complete. And I was trying to access a property that wasn’t initialized. Oops…now that’s a bug.

The workaround? To work around the problem I simply wrapped the offending property in a try/catch block. Then using the pattern I described in the previous blog post I was able to keep running verification checks until this property was either correctly initialized, or fail after a certain number of attempts. This isn’t 100% ideal, yet it let me keep going forward with the project until the vendor fixes the bug.

Lessons Learned. I’ve done kernel level debugging on Windows applications, but I really didn’t feel like learning how to do it with one or multiple Android devices. I was determined to try and narrow down the bug using the rather primitive tools at hand. The good news is it only took two hours. For me, it reaffirmed my own practice of implementing good error handling because I knew immediately where to start looking. I had multiple libraries and several thousand lines of code to work with. And, as I’ve seen before there are some bugs in Android that simply fail with little meaningful information. By doubling down and taking it step-by-step I was able to mitigate a very visible bug.

Tag: troubleshoot

The Art of Internet Connectivity

Troubleshooting multi-threading problems related to Android’s onResume()