When things go wrong on Linux

When something goes wrong on a Linux box, we often need to get it fixed as soon as possible.

Some people might know the "free" command, or perhaps "top", but there are many useful tools that will help you find out exactly what's going on.

Before I list them, it's important to mention that decent logging, and server metrics are important too.
I use Munin for my server stats (as well as some of the more common daemon stats such as Apache, MySQL, and Postgres). After it's been running for a few days, you start to get a good idea of what "normal" is on your server. Then, if you see a sudden jump in disk wait times, for instance, you might know that it occurred when you updated package x to version y. It's also very useful for when you have a problem, you start investigating, and you wonder "what is our usual value of x".
I often configure syslog to go to another server. This is useful in case something bad happens, and logs can't be written to local disk.
If you're particularly concerned, you can also configure kernel dumps to be written to a remote server, but that's not something I do as standard (only when trying to track down a particular issue).

Here are some of the ones I use on a fairly regularly basis.

Sysstat. Sysstat is actually a collection of utils, as well as sar, which records the stats of the system over the last 30 days.

iostat. This gives details about the system IO. This is very helpful in finding out if your problem is IO based (i.e. the system is suffering because the disks aren't able to service reads or writes quickly enough).
I commonly run it as iostat -xd /dev/sda 5. Remember that with all of the sysstat commands, the first output is the average since the system was booted, and not the live results. You need to wait for the second result to appear to see the current stats.
If the %util is up around 100, your disks are struggling.

mpstat. This gives CPU info. This is also available in lots of other tools, so not as useful.
Ideally, you don't want lots of %sys CPU, as that means Linux is using a lot, and there isn't as much left for %usr (your stuff). Fragmented files for instance, will cause Linux to use a lot of CPU seeking all the bits of the file. BitTorrent is bad for creating fragmentated files, unless you configure your client to allocate the whole file at the start. If you want to see if a file is fragmented, run mpstat 3 while copying that file to /dev/null - if the %sys goes high, the file is likely fragmented, and you'll have to burn processor cycles every time you read it. Solution? Copy it to a new file, and then delete the original.

vmstat. This gives virtual memory (swap) stats. If your box is low on memory, it might start using swap as "extra" memory. This might cause your disks to have to work hard, which might cause disk requests from other programs to be really slow. Look for the si/so (swap in, swap out) numbers.

A good "get an overview of your system" app is atop. Well worth a look.

Memory

Basic stats are from free -m
Because free memory is wasted memory, Linux "borrows" any free memory to cache files with, to avoid slow disk accesses. So if you see that you only have 93MB free, with 2.1GB being "used" by cache, this isn't an issue, because you actually have 2.2GB available.
In my opinion, there's no point using swap if you have free (or cache) memory available. I configure my boxes to not use swap unless they have to by setting the swappiness to 5.

Networking

tcpdump/tshark
I use tcpdump a lot. tcpdump -npi any not tcp port 22 is a common incantation. I often use the -w file.pcap addition, write the output to a file, and view it in Wireshark on my laptop, as it's more pleasant to view. Make sure you capture packets with -s 65535, as some version would only capture the first few (100?) bytes of each packet, which was annoying.

If you just want a vague overview of the networking on your machine, try iptraf-ng.

Don't forget the classic netstat either. You can see all the connections to your machine, as well as the networking statistics with netstat -s.

If your problem is IO

If your problem is IO, and it's being caused by swapping, you have three options.
1. Move your swap partition onto another physical disk. This isn't ideal, and doesn't really solve the problem.
2. Use less physical memory. (Shut some programs down)
3. Have more physical memory. (Buy some more)

If your problem is IO, and it's not swap, you probably want to find what program is doing all the reading and writing to disk. Use iotop for this.

Finally, when you find out which process is causing the problem, if it's something that's normally well behaved, you can take a peek into its gizzards with strace.
strace -f -p <pid>
This will sometimes spew an awful lot of stuff out to the output, so you might want to redirect to a file.
It's not the perfect tool for this, but I've often obtained an insight into problems with this.

PS. Disks die with boring (and statistically predictable) regularity. Back your stuff up. The stuff you only have on one hard drive you won't have in the future.

calum.org:~#