Linux: Looking for Large Folders with du

  • One of the most common tasks of the system administration is locating "what is using up the disk space" on a machine. You might look at a filesystem using df and find that it is using more space than you expected and you want to find out where that space is being used.

    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sda1       114G   18G   91G  17% /

    Using the du command, and combining it with a few, simple command line tools, we can quickly and manually explore the filesystem to look for important space wasters. We start by doing a summary du on the root of the filesystem in question, which is the root / in this case.

    # du -smx --exclude=proc /* | sort -n | tail -n 5
    705	/opt
    1050	/root
    2755	/var
    5572	/usr
    7135	/home

    Wow, that seems like a long command. We should break it down to understand what we just did. First the du portion. We start with the -smx flags. This mean to summarize each directory that we encounter, display output in megabytes and the x means to limit the recursive file discovery to only the current filesystem (any other filesystem mounted under one of those locations will be skipped.) The --exclude=proc portion tells the du command not to read the /proc folder as that is not an on disk system and will cause errors and delays in the command unnecessarily. The directory /* option denotes to read everything (the * wildcard) under the root / mount point. Then the output of that statement is piped (see our lesson on BASH redirection and pipes) into the sort command where we use the -n option to make it sort numerically instead of alphabetically. Finally we pipe that output into the tail command where we limit the output to the final (or largest) five items discovered by the initial du command. It is because of the sorting that we need to use megabytes instead of human readable form in the initial command.

    That might seem like a lot at first but once you know the simple building blocks of du, sort and tail along with BASH command structures it is quite simple and straightforward and similar to many tasks that we will do as system administrators using standard tools.

    Now, given the output of the command that we just saw, we can delve deeper into the directory structure to narrow down where culprits may exist. One of the reasons that we often do this task manually is that it is just simply quick and easy and does not require a more complicated tool, but also because we can easily massage the data to take into account things that we know about the system, like that the /home directory contains things that we cannot delete and investigating it is a waste of time (that's an example and would not normally be true.) In this case, we will assume that /var is using more space that we feel is appropriate and we will look there to see what is taking up the space that it is.

    We will change directory into the folder in question and run the original command again (removing the absolute path starting point to make it generic so that we can run it again and again.)

    # cd /var
    # du -smx --exclude=proc * | sort -n | tail -n 5
    5	log
    14	backups
    176	tmp
    297	lib
    2265	cache

    From this we now see that the cache is the big user of space within the /var directory. We can learn more about what is using space within that by repeating our steps from above.

    # cd cache
    # du -smx --exclude=proc * | sort -n | tail -n 5
    2	man
    7	cups
    7	debconf
    87	apt-xapian-index
    2163	apt

    And now we see that the apt directory (its absolute path at this point is /var/cache/apt) is what is using nearly all of the space not only of cache but of var above it.

    # cd apt
    # du -smx --exclude=proc * | sort -n | tail -n 5
    45	pkgcache.bin
    45	srcpkgcache.bin
    2074	archives

    Going down into apt we see that archives is nearly all of the space used with apt. We are learning a lot from a single, simple exercise. One more level, we will find what is going on:

    # cd archives
    # du -smx --exclude=proc * | sort -n | tail -n 5
    62	chromium-browser_48.0.2564.116-0ubuntu0.
    66	chromium-browser_49.0.2623.108-0ubuntu0.
    66	chromium-browser_49.0.2623.87-0ubuntu0.
    79	duck_4.9.2.19773_amd64.deb
    82	duck_4.7.5.18825_amd64.deb
    # pwd

    At this bottom most level our command turns up individual files that are of roughly the same size; this tells us that the final directory that we have arrived at (as shown with the pwd command) contains a large number of small files that together add up to take up the large amount of space that we had observed. We can verify this of course using either the ls or du commands, but we already know it to be true. We can also do a quick count of the files in the directory to understand the scope:

    # ls | wc -l

    That is a lot of files, no wonder that even being generally pretty small that they are taking up so much space.

    I recommend doing this as an exercise on your own system. Use du to delve into the filesystem and see what is taking up a large amount of space in different areas.

    Part of a series on Linux Systems Administration by Scott Alan Miller

  • I've always just used -h for human readable. I never realized -m would give you the MB size.

  • @johnhooks said in Linux: Looking for Large Folders with du:

    I've always just used -h for human readable. I never realized -m would give you the MB size.

    I have the "advantage" of having learned this stuff before the human readable flag was added 😉