Procedures for Web Server Troubleshooting

Written by Thomas Tongue. Last Modified August, 2001

1.0 Introduction

This document is a guide for Imagiware employees who are learning system administration and pager duties. It is a procedure to determine what to do when the pager is goes off, and how to restore the system to proper working order.

For this purposes of this document, when speaking of a "Web Server", we are generally referring to the synthesis of hardware, software and infrastructure that are involved in the live publishing of client Internet content.

2.0 Classifying the problem

There are many different levels of failures that can occur to impair or disable the proper functioning of the web server. These levels can be roughly broken into the following categories:

Minor Service Failure (MSF) The web server is still online, but one or more supporting services (such as Imagistat, billing, Doctor HTML, administration tools, logging, tape backup, SCSI backup, etc) are not working properly.
Secondary Service Failure (ISF) Web pages are being served, but one or more secondary services (such as email, Real Audio, Secure Server, Chat Server, DNS, etc) are not working properly.
Primary Service Failure (PSF) Web pages are not being served, but some services are still online, such as ssh, ping, etc.
Total Service Failure (TSF) The machine does not respond to any network or console requests.
Gremlins A loose collection of ailments that happen beyond the server machine, such as power outages, network connectivity failures, DNS failures and routing problems (to name a few).

The categories above are not perfect, but should guide you to the proper section for troubleshooting procedures and tips. To use the table above, follow the steps below to determine the proper category:

1.) Can you load a web page from the server? All machines are configured to respond to http://hostname.imagiware.com/ so it should not be difficult to determine this. Be sure your web cache has been cleared before you use this test to insure you get a fresh page. If you can load a page, then its either of the 1st two options. MSF and ISF are usually pretty subtle, and you've been alerted through varying means what the problem is.

2.) If you can't load a web page, try to ping the server. If you can't ping the server, try to ping another server at the cage. If you are able to reach the other server, but not the one that is giving you trouble, its probably TSF. If you can ping the troublesome server, try to ssh to the machine and see what the problem is under PSF. If you can't ssh into the machine, or can't reach it (or its neighbors) with ping, its probably a Gremlins condition.

2.1 Secondary and Minor Service Failure (SSF) (MSF)

Unfortunately, Secondary and Minor Service Failures can be the hardest to fix, because they require an examination of the failed program's operation. There are numerous places where one or more things can go wrong and prevent a program from working properly. The Guide to Server Programs provides an excellent resource for configuration files and common problems with key software on the server. Here are some helpful questions to guide your investigation and repair efforts:

1.) What has changed? (Aside from the fact that the program doesn't work anymore). If the program was working yesterday, what has changed in the past 24 hours that might precipitate a failure. This can be something as subtle as file permissions and ownership, but is more likely due to an upgrade or configuration change.

2.) What does this program require to work? Consider the inputs and outputs of the program, including template files. You will have to examine the code and any documentation on the program to determine this.

In some instances, the failure is indicative of a larger problem (such as long-term memory leaks). If the program has been operating for more than 1 - 2 weeks without being restarted, you should try to halt the program, then start it from scratch. For example, the chat server (which is written in java) has been known to malfunction after running for several weeks. Killing the chat server process, and then starting it fresh often fixes whatever problem has cropped up.

If the machine itself has been up for more than a month, an overall memory leak or library corruption problem may be causing a problem. If the machine has been up for more than 30 - 40 days, a 2 minute reboot may be in order.

2.2 Primary Service Failure (PSF)

This is often a problem with the functioning of Apache (httpd) itself. Below are some of the common problems and fixes that we have encountered so far:

Symptom Solutions/Explanations
Apache is hung while trying to start or restart. Apache is unable to resolve one or more domains in the httpd.conf. You literally have to wait about a minute or so for Apache to time out its DNS request and inform you what domain has failed to resolve. If there is more than one of them, then its a minute or so for each domain (it does not appear to run the requests in parallel). The solution is to place the domain and its www.domainname.(com|org|...) counterpart in /etc/hosts, where it should have been all along.
Apache crashes or exits while trying to start or restart Usually when this occurs, you will get an error message that might indicate where the problem is. Normally it's because the configuration files for the server have changed recently and now something is busted. The configuration files are located in /etc/httpd/conf. The most commonly changed files are vh.conf and httpd.conf, but you can check the modification time of the files using ls -la.

The solution is usually to fix the configuration, or comment out the offending area in the configuration file.

Apache dumps core (segmentation fault) Before proceeding too far, check to see if any of the reboot criteria discussed in the MSF/SSF section are met. If the machine (or apache) has been up for a long time, it may be necessary to reboot. Clearly, if this problem shows up right after a reboot, you have to look elsewhere.

This is very bad. It means that either Apache has recently been upgraded or otherwise recompiled (and is broken), in which case you must fall back to an older version (which should be available as an RPM, or a date-indexed compiled binary).

The other alternative is that a library which Apache needs has been upgraded or re-compiled (you can find out what libraries a binary requires using ldd filename where filename is the binary file that you want to check).  This is also very bad, since falling back to an older library can be more difficult. Confirm that the dependent libraries really have changed recently, and if they have, then you have two options:

1.) Recompile Apache with the new libraries
2.) Fall back to older versions of the libraries, and figure out whats going on.
Ultimately, you have to re-create the conditions under which apache used to work. If that means older binaries, libraries, etc, so be it.
Disk partition is full

If you get a disk-full page, you'll need to log into the complaining server. A quick df will show you the mounted partitions and how full they are. The pager only complains if a disk is more than 95% full. If the full partition is /backup or /trash, the problem is with the (external) backup disk filling when the backup program runs. Typically this happens when a large number of files are deleted in one day (e.g. if a client is deleted) and /trash fills with the backups of all of that data). We usually end up deleting an old day's trash folder (see /trash for dated folders). If /backup is full, that's harder, because that's supposed to be a complete mirror. Fixing that may require reconfiguring the backup program to use the backup partitions differently. Killing the metaRsync or metaBackup program and unmounting the partitions will stop the pager from going off.

If a partition other than /trash or /backup is filling, the situation is a bit more serious. This may be /tmp filling due to a runaway CGI, /var filling due to a denial of service attack filling the system log or something similar. Use df to find the partition and experience plus du to find the problem and fix as necessary.

2.3 Total Service Failure (TSF)

If you've reached this phase, it means you've tried to remotely communicate with the machine (through ping, ssh, HTTP, etc) and failed, even though neighboring machines are working perfectly. This means that an immediate trip to the cage space in downtown Madison is required to fix the problem.

If you cannot get the console to respond, then you will need to reboot. Shut the power off on the machine, wait a few seconds, then turn it back on again. Follow usual booting procedures outlined below. The disks were not cleanly unmounted, so you can expect a full file system check.

If you can get the console to respond, login and try to ping the outside world (or perhaps just a neighboring machine). If you can't ping anything, check the network cable between the server and the switch, and restart the network using /etc/init.d/network restart. If you can ping ANYTHING outside the local IP block (205.254.196) on the server itself, it's a networking problem, and if no other problems present themselves, consult the Network Troubleshooting Guide. If you can't ping anything, reboot the machine by hitting CTRL-ALT-DEL. Then follow usual booting procedures.

3.0 Booting Procedures

This section covers the basics of booting (and re-booting) the web server. In most cases, you will need to be at the cage to reboot the server, though for Minor and Secondary Service Failures, you can often reboot remotely.

3.1 Rebooting A Web Server

To reboot a machine through remote ssh access or at the console, login as root and type:
/sbin/shutdown -r now

If you are unable to do the above, you need to be at the console. If the machine is still active, but for some reason you can't login as root, hit CTRL-ALT-DEL and that will attempt a clean reboot process, though some lingering processes might prevent the disks from being cleanly unmounted. Allow at least 20 seconds for it to complete shutting down before considering other measures.

If the console is inactive, and CTRL-ALT-DEL  seems to have no effect, hit the reset switch (or shut off the power for 10 seconds), and follow usual booting procedures. This is the worst of all reboots, because the filesystems are now unclean, and must be checked by fsck before the boot process is completed.

3.2 Booting A Web Server

Depending on what state the machine was in when it was shutdown, booting the web server can be an easy or difficult task. When the system first powers up, it will run through a token memory test, then handle several BIOS and hardware functions (like initializing the SCSI devices and identifying hard drives, CPU and memory). Finally, it should reach a LILO prompt. You then have a few standard options:

*For a standard boot, simply hit return and the system will follow the default settings. For remote reboots, the prompt will expire after 5 - 10 seconds, and the default image and settings will be used.

*To boot into Single User Mode, hit TAB to get a list of the kernels available. The first one on the list (usually called zImage or bzImage) is the default. Type the name of the kernel image that you wish to use, then a space, and then a capitol S. For example:

LILO: zImage S
You really shouldn't need this, and if you do, you must have some idea what else you need to do.

If the filesystems were clean, then the system will take roughly 2 - 3 minutes to boot, most of that time spent initializing the nameserver and setting up IPs.

If the filesystems are not clean, then e2fsck will be run automatically. Normally it can fix any problems it encounters without your intervention, so even if it happens during a remote reboot, it will take a little longer, but still finish without a problem. If there is a serious problem with the filesystem, it will exit and put you at a prompt to run e2fsck manually. Just run:

/sbin/e2fsck /dev/device

Where /dev/device is something like /dev/hda1 or /dev/sda1, etc. Whatever e2fsck was checking when it exited is the device you want to use. While e2fsck is running manually, it may ask you if it should repair or make certain changes to the filesystem. You should answer yes to all its questions (only black-belt file system administrators would be able to fix the problem without saying "yes").

When e2fsck is finished running manually, type "exit" to reboot the machine and go back to the beginning of the boot procedure.

3.3 Tests Once Booted

Though there are automatic scripts which monitor the system status, it's re-assuring to make a few checks before considering the boot process complete.

Check Apache: You can do this by loading a web page that is hosted on the server, or by using 'tail -f' on one of the live log files. Try:

tail -f /www2/global_logs/global_log

On titan or janus, use:

tail -f /www1/logs/cyberdiet/access_log

If hits are rolling in, then Apache is (probably) running properly.

Check Kernel Log: By using 'tail -f /var/log/kernel' you can see any kernel level error messages, including file system errors. You can also run dmesg to see this information. If there are any file system or GPF (general protection fault) errors, it's necessary to reboot the machine.

Check SCSI Backup Disk: If you groped around the machines at all, it's possible one of the SCSI cables came loose. On all machines except janus, make sure the external backup disk is still accessible with:

/var/adm/www/bin/maintenance/backup/metaFsck

If this has been run recently, the program will just zip through. If it's been a long time since it was run, you may have to wait for the disks to have a full fsck done (this doesn't take too long). If there are any errors, find out why the disks are unreachable.

4.0 Pager Code Quick-Reference

The pager code is supposed to give you some idea of the source of the problem. You should have a wallet card with the following information, but we include it here just in case.

The pager code will look something like 271-2711XXX indicating the page is from an Imagiware server. The last three digits will indicate the host number of the machine sending the page, the host number of the machine it's complaining about, and the error code. For example, a pager code of 121 means the pippin server cannot reach the frodo HTTP. This may mean a network problem on pippin, or a problem with Apache on frodo. The best place to start is to try loading a Web page on frodo. A code of 332 means Venus thinks one of the disk partitions is dangerously full. In that case, you'll need to log into venus and see which partition is filling.

Code Host Error
0 --- Net
1 Pippin HTTP
2 Frodo Disk
3 Venus Frontpage HTTP
4 Zephyr ----
5 Titan ----
6 Janus ----