Sunday, March 12, 2006

Troubleshooting UNIX Systems with Lsof

One of the least-talked-about tools in a UNIX sysadmin's toolkit is lsof. Lsof lists information about files opened by processes. But that's really an understatement.

Most people forget that, in UNIX, (almost) everything is a file. The OS makes hardware available to applications by way of files in /dev. Kernel, system, memory, device etc. information in made available inside files in /proc. TCP/UDP sockets are sometimes represented internally as files. Even directories are really just files containing other filenames.

Lsof works by examining kernel data-structures and provides a variety of information related to files, pipes, sockets and more.

Lsof is installed by default on most Linux distributions, BSD distributions and OS X. Binary packages for Solaris, AIX, HP-UX, *cough*SCO OpenServer*cough* and many other UNIXes (Unices?) are available on the web.

So, just how useful is lsof?

Deciphering its Output

Switch to root, and type lsof on the commandline.
linux# lsof
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
init 1 root cwd DIR 3,65 4096 2 /
init 1 root rtd DIR 3,65 4096 2 /
init 1 root txt REG 3,65 29556 172317 /sbin/init
init 1 root mem REG 3,65 1166880 93908 /lib/libc-2.3.5.so
init 1 root mem REG 3,65 103053 93909 /lib/ld-2.3.5.so
init 1 root 10u FIFO 3,65 48438 /dev/initctl
ksoftirqd 2 root cwd DIR 3,65 4096 2 /
ksoftirqd 2 root rtd DIR 3,65 4096 2 /
ksoftirqd 2 root txt unknown /proc/2/exe
events/0 3 root cwd DIR 3,65 4096 2 /
events/0 3 root rtd DIR 3,65 4096 2 /
events/0 3 root txt unknown /proc/3/exe

...SNIP...

syslog-ng 6529 root txt REG 3,69 114132 84690 /usr/sbin/syslog-ng
syslog-ng 6529 root mem REG 3,65 1166880 93908 /lib/libc-2.3.5.so
syslog-ng 6529 root mem REG 3,65 64568 93943 /lib/libresolv-2.3.5.so
syslog-ng 6529 root mem REG 3,65 75176 93924 /lib/libnsl-2.3.5.so
syslog-ng 6529 root mem REG 3,65 103053 93909 /lib/ld-2.3.5.so
syslog-ng 6529 root 0u CHR 1,3 47320 /dev/null
syslog-ng 6529 root 1u CHR 1,3 47320 /dev/null
syslog-ng 6529 root 2u CHR 1,3 47320 /dev/null
syslog-ng 6529 root 3u unix 0xdea00e00 672127 /dev/log

...SNIP...

asterisk 7001 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7001 root 11r FIFO 3,70 306 /var/run/asterisk/autod
ial.ctl
asterisk 7001 root 12u IPv4 6834 UDP *:5060
asterisk 7001 root 13r FIFO 0,5 6019 pipe
asterisk 7001 root 14u IPv4 6016 TCP localhost:5038->localho
st:32768 (ESTABLISHED)
asterisk 7001 root 15u IPv4 6835 UDP *:2727
asterisk 7001 root 16u IPv4 6861 UDP *:4569
asterisk 7001 root 17u REG 3,70 0 593222 /var/lib/asterisk/astdb
asterisk 7001 root 18r FIFO 0,5 6883 pipe
asterisk 7001 root 19u REG 3,70 39402 32066 /var/tmp/iaxy.bin-19098
89093 (deleted)
asterisk 7001 root 20w FIFO 0,5 6883 pipe

...LOTS MORE SNIPPED...

What you will be presented with is a very long list of open files, which you might want to pipe through your favourite pager.

By default (on Linux), lsof displays the following information about each open file:

  • COMMAND: The name of the UNIX command associated with the process.

  • PID: The Process ID.

  • USER: The user ID or login name of the user to whom the process belongs.

  • FD: The file descriptor number of the file or a code representing more information about the structure. See manual page for details.

  • TYPE: The type of the node associated with the file. E.g. REG signifies a regular file, IPv4 or IPv6 signifies an IP socket, DIR a directory, "unix" a UNIX domain socket, etc.

  • DEVICE: Usually contains major and minor device numbers for the files, or addresses/references for other structures.

  • SIZE: The size of the file or the file offset, in bytes. (If available.) In the case of files that don't have true sizes (eg., sockets, pipes), lsof displays the size of the content their kernel buffer descriptors.

  • NODE: Node number / inode / Internet protocol type (TCP) etc.

  • NAME: The name of the file / mount point / device / Internet address / etc.

For a comprehensive description of these fields, refer the lsof manual page.

Since lsof works by examining kernel memory, you will need root access to be able to fully utilize it. A non-root user will not have access to information that belongs to other users.

Common Usage

Lsof is usually run with one or more of the following options:

  • /path/to/file: List processes, owners and open file descriptors that are currently using the specified file.

  • -i [46][protocol][@hostname|hostaddr][:service|port]: List Internet files / sockets.

  • -u name: List files owned by user.

  • -p pid: List files open by specified process.

  • -t: Terse output. No headers, only PIDs. Useful within scripts.

  • -n: Disable resolving of network names.

  • -N: List NFS files

These options are ORed by default.

Display all internet files OR files opened by user "foobar".
# lsof -u foobar -i

To display all internet files that are opened by foobar, you need to apply the AND (-a) condition between the switches.
# lsof -u foobar -a -i


The following recipes demonstrate how lsof can be used to troubleshoot real-world problems.

Recipe #1: Finding Port Hogs

Your web-server is refusing to come up because port 80 is in use by another process. How do you track down the offending process?

# lsof -i

... SNIP ...

asterisk 7554 root 16u IPv4 6861 UDP *:4569
postmaste 7688 postgres 5u IPv4 5955 UDP localhost:32768->localhost:32768
postmaste 7689 postgres 5u IPv4 5955 UDP localhost:32768->localhost:32768
sshd 27038 root 3u IPv4 677971 TCP reddwarf:ssh->CPE.xxxx.com:61702 (ESTABLISHED)
sshd 27043 mohit 3u IPv4 677971 TCP reddwarf:ssh->CPE.xxxx.com:61702 (ESTABLISHED)

... SNIP ...

Nice. A list of open Internet sockets, along with the processes, addresses and owners. Also note that (similar to netstat), the TCP states are displayed. Above, we can see two established ssh sessions in progress.

Let's add a port filter and find exactly what we're looking for.

# lsof -i TCP:80
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
lighttpd 7356 lighttpd 3u IPv4 6409 TCP *:http (LISTEN)

Okay, so lighttpd is the reason why Apache won't run. That's probably a good thing.

Recipe #2: Finding Processes Within a Given Port Range

You need to find a range of free ports for your new multimedia application.

# lsof -i TCP:5000-5200
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
asterisk 7001 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7001 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7002 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7002 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7039 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7039 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7040 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7040 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7041 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7041 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7042 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7042 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7044 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7044 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
perl 7046 root 3u IPv4 6054 TCP *:5100 (LISTEN)
perl 7046 root 4u IPv4 6055 TCP *:5101 (LISTEN)
perl 7046 root 6u IPv4 6056 TCP localhost:32768->localhost:5038 (ESTABLISHED)
asterisk 7073 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7073 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)
asterisk 7504 root 10u IPv4 6015 TCP localhost:5038 (LISTEN)
asterisk 7504 root 14u IPv4 6016 TCP localhost:5038->localhost:32768 (ESTABLISHED)

Recipe #3: Listing User Files

What files do users "foobar" and "apache" have open?
# lsof -u foobar,apache

List UDP ports in use by user "mohit".
# lsof -i UDP -a -u mohit

Who's responding to "who"?
# lsof -i UDP:who


Recipe #4: Unmounting a Disk or Filesystem

Sometimes you need to track down the user or process that's blocking you from unmounting a disk.
# umount /opt
umount: /opt: device is busy
umount: /opt: device is busy
# mount | grep "/opt"
/dev/hdb9 on /opt type ext3 (rw,noatime)
# lsof /dev/hdb9
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
perl 7046 root 2w REG 3,73 111 1376386 /opt/local/paynacea/var/state/callmanager.pid.err
perl 7046 root 5w REG 3,73 6783 1376385 /opt/local/paynacea/var/log/callmanager.log
# kill 7046
# umount /opt

Or the simpler:
# kill `lsof -t /opt`


Recipe #5: Finding Device Hogs

Who's using the audio manager?
# lsof /dev/audio

Why can't I start my alternate logger?
# lsof /dev/log

Why doesn't my CD eject?
# lsof /dev/cdrom


Recipe #6: Using Exclusions

The '^' (negated) modifier can prefix the User or Process ID parameters to exclude them from the resulting list. Since they represent exclusions, they are applied without ORing or ANDing and take effect before any other selection criteria are applied.

List all Internet files/sockets open by non-root users.
# lsof -i -u^root


Recipe #7: Recursing Directories

The '+D' option causes lsof to search for open files within a specified directory, recursing down to its complete depth.

List all processes that have files open in /tmp.
# lsof +D /tmp

The '+d' option does the same thing, but does _not_ descend the directory tree.

Recipe #8: Matching by Process Name

List all files open by processes beginning with the letters mpg.
# lsof -c mpg


Using a regular-expression.
# lsof -c '/post.*er/'


Recipe #9: Examining Suspicious Processes

Lsof can be used along with strace to examine and monitor the operation of viruses, worms or spyware.

What files are opened by PID 14554?
# lsof -p 14554


Who's looking at the password file?
# lsof /etc/passwd


Recipe #10: Repeat Mode

The -r switch puts lsof in repeat mode. It delays every 15 seconds (unless specified), and displays another listing.

Watching a user's open files every 5 seconds:
 # lsof -u badcop -r5 

Monitoring the password file:
# lsof /etc/passwd -r 2


Recipe #11: Finding Deleted Open Files

This recipe was added on 26/Mar/06 after an anonymous poster left a comment regarding deleted files.


One of the most annoying problems is a file-system quickly running out of space, without a hint of what file is responsible for it. This happens when a file (usually a log-file), gets deleted while it's still being written to. When you delete an open file, the kernel unlinks the file from the directory, but cannot remove the inode, since it's still open.

This causes the file to continue to grow, with no trace of its existance anywhere. Well... almost anywhere.

Lsof provides the +L parameter to list the number of link counts an open file has. When followed by a number, lsof only displays files with link counts less thatn the specified number.
mohit@reddwarf ~ $ lsof +L3
COMMAND PID USER FD TYPE DEVICE SIZE NLINK NODE NAME
sshd 11540 mohit mem REG 3,69 303448 1 85869 /usr/sbin/sshd
sshd 11540 mohit mem REG 3,65 35404 1 94075 /lib/libnss_nis-2.3.5.so
sshd 11540 mohit mem REG 3,65 30928 1 94086 /lib/libnss_compat-2.3.5.so
sshd 11540 mohit mem REG 3,65 35236 1 93958 /lib/libnss_files-2.3.5.so
sshd 11540 mohit mem REG 3,65 28444 1 94094 /lib/libcrack.so.2.8.0

A deleted file has zero links. So the following command displays deleted-but-open files on a system.
$ lsof +L1

Display a list of deleted-but-open files within a specific filesystem.
$ lsof +aL1 /tmp


Finally

We barely scratched the surface with the above recipes, but as you can see, lsof is a powerful troubleshooting tool. I'd be interested in learning what other users do with lsof. Toy with it, tinker with it, use it and let me know how it has helped you.

18 comments:

  1. great article. thanks.

    ReplyDelete
  2. Thanks, this was a good one. Always have trouble with umount.

    (Now I need a tutorial on autofs ... :-))

    ReplyDelete
  3. This is a useful and well written article. I hadn't realized how capable lsof was. In the past, I have personally resorted to grepping through files under /proc/[0-9]*, using netstat to find port numbers etc, to do such detective work, not always successfully. I had no idea that there was a much simpler way. Thanks.

    ReplyDelete
  4. Thanks for the comments. I hear ya about the /proc escapades. I sometimes still have to do that on systems where lsof cannot be installed.

    ReplyDelete
  5. Great!

    There is a nice GUI for lsof here

    ReplyDelete
  6. in addition to the things mentioned previously, lsof is the only tool I know of that allows you to find deleted but open files. eg:

    lsof /tmp | grep deleted

    open, unlinked files are troublesome bc their space is still being taken up on the filesystem, but they cannot be found with du. the space is not released until the process closes the file (or exits).

    ReplyDelete
  7. That is indeed a very useful article for system administrators.I never knew LSOF had so many uses other than stock ones.
    Thanks indeed.
    Would definitely love to see more of these.

    ReplyDelete
  8. On Mac OS X I've noticed something very bothersome. I've been reporting this issue for a while it seems, but to no avail.

    There appears to be some sort of problem with the number of system processes or such when you get up to around 2.5GB RAM utilization. I have 4 GB on this workstation.

    I thought I'd play with lsof again when I say your article as I used to use it quite a bit back in my linux days and early Mac OS X days, but now I've not really used it much.... however when I try it with this certain apparent scenario I'm receiving the following error:
    lsof: can't read process table

    After some time passed, I thought I'd check it out again and then I received something different:
    COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
    lsof 7749 root cwd VDIR 14,8 3774 3123 /private/etc
    lsof 7749 root 0 can't read fileglob struct from 0x07c7eb58
    lsof 7749 root 1 can't read fileglob struct from 0x07c7eb58
    lsof 7749 root 2 can't read fileglob struct from 0x07c7eb58
    lsof 7749 root 3 can't read fileglob struct from 0x07c7fdb4

    Any idea as to what could possibly going on here? I am wondering if sysctl or lauchd is somehow buggered up or such.

    ReplyDelete
  9. ylon,

    looks like you don't have enough permission to examine the process table.

    switch to root, or use sudo.

    $ sudo lsof ...

    hope this helps.

    ReplyDelete
  10. Sorry, thought I'd mentioned that... :) I am su'd to root.

    ReplyDelete
  11. Also... on a busy system lsof will hang.
    Using -C prevents lsof from reporting
    path name componets from the kernel's
    name cache, which makes your query much
    faster.

    ReplyDelete
  12. Great articles! Although I come from "big iron" UNIXes (Alpha servers et al), the explanations are great!

    As for lsof itself, it saved my ass once... We had a filesystem filling up and no culprit could be found. Indeed, we found a log file growinf to no end. Incredibly enough, lsof is NOT installed by default in TRU64, so i had no idea what process whas dealing with it.

    I found a binary of lsof on the web for this OS, and it pointed to the culprit. Incredibly enough, there were actually 2 files, the first we had found and a second -and till today I have no idea how did that happen- which was in the mountpoint of another filesystem. So, unless I umounted that FS (impossible since the box was live) I had no access to it. But killing the process in question solved almost all of the issues, and the offended file was deleted later on...

    ReplyDelete
  13. You have a talent of explain complex programs to average people.

    Thanks.

    Can you write another post explainning the tcpdump program? that's another common toolkits.

    ReplyDelete
  14. Very nice article Mohit.
    Are you going to follow writing on this great blog?
    I would like read more interesting articles like this one.

    Regards.

    ReplyDelete
  15. Ah, I was just "maning" through lsof man page and tearing my hair off :-) Good article!

    ReplyDelete