10 hard earned hints on how to increase your website uptime

Here are some tips on how to keep your website up and running from a hardware perspective. It’s these things that are often overlooked or underfunded that bring a website down. I spend all day working with software, but the hardware is often a mystery especially to software developers.

Twitter’s recent outage brought to light an important fact about any website: they can crash. Of course, Twitter’s outage affected perhaps a hundred million people globally, but the point is that your website can go down as well.

Let’s take a look at how uptime is typically calculated on a monthly basis. 30 days x 24 hours = 720 hours. If the website is down for 8 hours during that time period, it translates into an uptime of 712 hours. And, 712 divided by 720 equals an uptime of 98.888%.

Now generally that doesn’t sound too bad, and most outages are intentional due to maintenance and therefore occur in the middle of the night, or at some other time when web traffic is the expected to be the least. When you start noticing outages is when they occur during what you would consider as normal business hours. Because when that happens then your employees or visitors productivity can take a hit.

Oftentimes, for typical businesses very little focus is placed on the following. It’s my contention, for those of us who aren’t Twitter, you can easily reach near perfect uptime using the following guidelines. Even if you only have a handful of customers that hit your website a few times a week, if they experience an outage it will negatively reflect on you even if you are contractually covered. With a little work, you can significantly reduce down time.

  1. Use real-time monitoring tools or build your own. At a minimum you need to monitor: CPU, available memory, available harddisk space, current inbound bandwidth usage, and time to complete an HTTP request to at least your home page. Assign staff members who will get emails and text messages if the monitor detects a problem. Expect them to respond within a certain time period even on holidays and weekend.
  2. If you use a system monitor, host it somewhere other than where your server farm is located. This seems obvious but I’ve seen it happen where the monitor went down with the server since they were on the same machine, or located on the same site.
  3. Always, always, always take an image backup of the previous version of your website machine. Yes, I mean an image of the entire machine and not just a copy of the website directory. It’s easy to do, it doesn’t take long, and it doesn’t take much storage space these days to do what’s called a snapshot image.  With modern software, you should be able to do this with just a couple of clicks with your mouse.
  4. If you don’t use the cloud, have a cheapo spare server machine that can be swapped out for your more expensive super, ultra redundant machine. You can buy a decent quad-core, 16GB RAM server for a roughly $400 – $500 dollars as an insurance policy.
  5. Have a cheapo spare even if you use virtual machines. I’ve personally seen two instances of complete shutdown of all virtual instances when the master server blew out its memory.  Computer memory is like the engine in your car – when it dies your forward progress will come to a complete halt. I can tell you with certainty that most servers that you and I use don’t have redundant memory capabilities.
  6. If your primary server is down for some unknown reason you have several immediate choices: restart it and hope it comes back up, spin up a copy in the cloud, or direct traffic to your backup server. Now, all off these are great strategies as long as your site isn’t under a denial of service (DoS) attack or getting overwhelming amounts of standard, non-vengeful traffic. You can find this out by using the monitoring tools mentioned in #2. Here’s one possible approach. Whatever approach you use you should test it and document it:

    – Restart primary server. If it restarts okay then you’re golden.
    – If primary crashes, then plug-in either a hot-spare or spin up the stand-by copy.
    – If the backup copy crashes then go to the rollback snapshot copy that you made.
    – If the rollback snapshot copy crashes then you have no choice but to start doing a detailed investigation of log files and other forensics.

  7. Do your investigation or forensics after you’ve restored service. This should probably be rule number one! Your website visitors will appreciate this. I’ve been in several situations over the years where the focus was on figuring out why a system went down first while the users languished with no service. Do your best to get things started again and then worry about what happened.
  8. I’m going to mention ISP redundancy because it is definitely overlooked. If you have hosted your server at a major hosting provider, they often contract with multiple internet backbone providers. The backbone providers are the primary carriers of all internet traffic. And, the hosting facility will have multiple trunks from various primary carriers coming into their facility on different sides of the building in case one trunk is accidentally cut or has an outage. Check with your hosting provider for options on this.
  9. Universally everyone will tell you to cache your static content on a CDN. I agree with this although, to me, this is more of a performance issue than an uptime issue and that’s why I have it down here at number 9. It’s possible that if you have huge amounts of static content that this could prevent your server from crashing, but for most of us this is a performance boost.
  10. If you are getting overwhelming amounts of non-vengeful traffic, one concept I used to great success was to place a simple HTML file that was served up every so many visitors that said we were experiencing high traffic volumes. The simple file took very little overhead to serve and had an immediate and significant reduction in the load on the server. Naturally, when the event was over we took the message down.

I’m not going to address Denial of Service attacks since I’m not a security expert. There are plenty of papers on the web and experts you can call. I can say I have averted limited DoS attacks in the past by filtering IP addresses and working with an experienced network service provider. But, if you are under a widespread DoS attack get expert help immediately.

[Edited 6/23 – added a tenth hint! Fixed minor typos.]

What do you read for technical news?

It was just a few years ago that I regularly scanned a list of mainstream developer and IT rags on a weekly basis: ComputerWorld, Visual Studio Magazine, SD Times, JDJ, and the list goes. Then one day early last year the thought hit me and I was a bit stunned to realize that I’d stopped reading online magazines. The vast majority of my technical info was now coming from blogs, online help docs and dizzying amount of internet searches. So, what happened? I have some theories that I’ve been thinking over the last year, but I can’t really nail down anything for certain. So, it’s most likely a combination of these five concepts below.

Rapid Change. My first thought is that technology has been changing so rapidly that simply digesting the changes and understanding them takes a huge chunk of time. Huge. I’ve blogged about this a number of times. This trend cuts across the entire tech industry. The upside is that innovation happens overnight and fixes as well as new features come out quickly. The downside is it’s harder for everyone to stay on top of all the changes across updates to features, libraries, SDKs, smartphone operating systems and browsers.

Super busy. My second thought is:  in addition to staying on top the über release cycle of web and mobile technologies, I’ve been so busy with project work that I simply had to narrow the scope of what I was reading. It’s a balancing act and there’s only so much time in the day. Superfluous information seemed to just slow me down, or worse it felt like a distraction from the day’s objectives. And in today’s online environment there is such a huge flow of information. So, there has to be a way or mechanism for focusing and filtering the fire hose of inbound data.

Irrelevant Info. My third thought is that every time I try to go back and read mainstream rags that I find myself sifting through a bunch of stuff that isn’t relevant to my immediate or near-term needs. Like I mentioned above, a good portion of it often seemed to be superfluous. Don’t get me wrong: online magazines offer well written and well thought out information. But, I felt the extra information, or perhaps even information overload in some cases, slowed me down. If it takes time to sift through article after article looking for a specific topic, my inclination is to go back out to a search engine and narrow down my search parameters.

Online Search Engines. Search engines have done an excellent job of (rapidly) indexing online technical content. I don’t need to mention them by name because you know all the players. At work we’ve often joked about a pattern we call “coding by search engine.” The pattern goes something like this:  copy a class name or error message, paste it into the search bar and then skim through the results. If you have to go more than one page deep in the search results then stop and redo the search. Mostly gone are the days of sifting through reams of paper documentation or digging around in some esoteric corner of a vendors website. I don’t think most customers will stand for that anymore. I think more information is instantaneously available at our fingertips now than any other time in history. It is astonishing, really.

Forums. My final thought is the voice of the developer community has never been more important. Online forums, such as Stack Overflow, have become to be perceived as definitive sources on technical questions of all kinds and about all different sorts of programming languages. I’ve been in many conversations where, right or wrong, someone interjected with a comment about something learned on Stack Overflow, or similar sites. These sites are well indexed by search engines, the community can vote answers up or down, and many brilliant and knowledgeable players contribute their knowledge. These are excellent, speed-of-light resources that are freely available.

So, there you have it. This is my two cents of what I’m reading these days and why I think I changed what I read. Leave a comment or email me about how you get your technical info injection. I’m really curious to hear your experience.

* Clip art courtesy of Microsoft Office 2007.

The Largest Conference For Mapping and Geospatial Developers – Esri DevSummit 2012

I’ll be presenting at the Esri DevSummit next week so if you are attending please swing by my sessions and say “hi”. If you aren’t familiar with Esri or the conference, about 1400 developers and other technical experts converge on Palm Springs, California every Spring to learn all things technical about building commercial and enterprise geographic information systems. There will be everything from introductory Dojo workshops to deep dives into the heart of our APIs.

If you’re around here’s my schedule. I’d be very interested to hear about what you are working on:

Monday,  March 26

Getting Started with the ArcGIS Web APIs – 8:30am – 11:45am, Pasadena Room. I’ll be presenting the portion related to our ArcGIS API for JavaScript.

Gettings Started with Smartphone and Tablet ArcGIS Runtime SDKs – 1:15pm – 4:45pm, Pasadena Room. In this session, I’ll be presenting on our ArcGIS Runtime SDK for Android.

Wednesday, March 28

Flex the World – 10:30am, Demo Theater 2. I’ll be presenting with my esteemed colleague Sajit Thomas on Apache Flex.