Sparse files – what, why, and how

23 May 2008

Sparse files are basically just like any other file except that blocks that only contain zeros (i.e., nothing) are not actually stored on disk. This means you can have an apparently 16G file – with 16G of “data” – only taking up 1G of space on disk.

This can be particularly useful in situations where the full disk may never be completely used. One such situation would be virtual machines. If a virtual machine never fills the disk entirely, then a certain amount of the disk will never have anything but zeros in it – permitting the saving of disk space by using a sparse file.

The operating system (which supports sparse files) knows that the block “exists” but is null, so it provides the zero-filled block out of thin air. As soon as the block contains data, the data is written to disk in the appropriate way and the file on disk grows.

There are problems with using sparse files. The most egregious would be that any utility that does not recognize or utilize sparse files can replicate the entire file on disk, so that a 500M sparse file could suddenly balloon to 6G. Even utilities that can work with sparse files must be told to do so, so this trap is easy to fall into.

Another problem is that everything about disk management is based on how many sectors or blocks are used – and the disk size reported for a sparse file is the full size of the file (not the actual number of blocks on disk). This also means that any utilities that work solely with blocks on disk (e.g., du) will report different amounts than other utilities.

Yet another problem is whether backup programs recognize or preserve sparse files. A backup program that does not recognize (and store) sparse files may well be quite oversized. A restore of this flawed backup – or indeed, a restore that does not recognize properly backed up sparse files – will balloon in size as mentioned before. In a worst case, there would not be enough room on disk to restore the data, if the sparse file expands enough to fill the disk.

To get the ls command to report actual on-disk sizes (instead of just the file size typically reported) use these options:

# ls -ls
total 96
8 -rw-r--r-- 1 root root 2003 Aug 14 2006 anaconda-ks.cfg
68 -rw-r--r-- 1 root root 59663 Aug 14 2006 install.log
8 -rw-r--r-- 1 root root 3317 Aug 14 2006 install.log.syslog
12 -rw------- 1 root root 10164 Oct 25 2006 mbox

The blocks used are in the left side column; the bytes used are in their usual column just before the date. In this case, all of the files are using appropriate number of blocks – that is, none of these files are sparse.

Creating a sparse file basically amounts to a simple process: create a file, and seek to the desired end of the file. This can be done at the command line with a command like (to create a 1M sparse file):

dd if=/dev/zero of=sparse-file bs=1 count=1 seek=1024k

Here is an example, run on HP-UX:

# dd if=/dev/zero of=sparse bs=1 count=1 seek=1024k
1+0 records in
1+0 records out
# ls -ls sparse*
2 -rw-r--r-- 1 root sys 1048577 May 22 12:58 sparse
#

Looking at HP-UX (and perhaps other environments), there does not appear to be a wide amount of support for sparse files in the typical utilities (such as tar, cp, cpio, etc.), even as the operating system itself will dutifully create sparse files.

However, GNU cp supports the –sparse option. By default, GNU cp attempts to detect sparse files and recreates them as warranted. A file can be copied into a sparse format using –sparse=always, or into a nonsparse format using –sparse=never. The default, alluded to earlier, is –sparse=auto.

GNU tar uses the –sparse option (or the equivalent -S option) to make tar store sparse files appropriately.

GNU cpio supports the –sparse option, which operates similarly: any suitable length of zeros is recorded as a “empty” and the file is stored or created as a sparse file.

The command rsync, while not a GNU project, does support sparse files. The option –sparse (or -S) will attempt to handle sparse files efficiently, that is, not creating file blocks full of only zeros.

It appears that utilities pax, scp, sftp, and ftp do not in general support sparse files. Using such a utility, then, would make a sparse file balloon in size.

Given this utility support for sparse files, the best native environment is proably Linux or MacOS X, or other open platform. More specifically, any platform where the GNU utilities are available, along rsync.

Recognizing a sparse file can be done at the command line with the previously mentioned ls -ls command. To see a sparse file (right down to the block level), one way would be to write a C program (or other language) to start at the end of the file, and truncate the file one block shorter – if the file on disk becomes shorter, then that block was on disk; if not, then it was a sparse block. Of course, any block that contains data is on disk; it is only blocks with zeros with which there is any question as to whether it is on disk or not. Jonathan Corbet had an article in December of 2007 in the Linux Weekly News that described a method that the Solaris ZFS developers had proposed, and posed the question as to whether Linux should support similar system calls that will seek to the next hole or the next set of data in a file.

Entry Filed under: HP-UX, Linux, OpenSolaris, Solaris, Tips, UNIX. Tags: , , , , , .

4 Comments Add your own

  • 1. Virtual Machine Disk Imag&hellip  |  16 December 2008 at 2:08 am

    [...] UNIX sparse files [...]

    Reply
  • 2. Compressing Virtual Image&hellip  |  16 December 2008 at 7:36 pm

    [...] UNIX sparse files [...]

    Reply
  • 3. Paul Evans  |  17 December 2008 at 4:10 pm

    Excellent article, David. There is a fair amount of interest on the Web on sparse files thanks to the proliferation of VM’s. Users prefer VM’s that can grow dynamically instead of allocating a large amount of storage statically at creation time. I have linked to your article in my survey at

    http://sharevm.wordpress.com/2008/12/13/virtual-machine-disk-image-compression/

    to help people understand the nuances of their choices

    Reply
  • 4. yungchin  |  24 June 2009 at 2:17 pm

    Thank you for typing this up, it was incredibly useful to me. You beat the Wikipedia page in comprehensiveness: it doesn’t include tar, cpio or rsync.

    Of course I wish I’d read this before ballooning a file and filling my whole disk, but at least I learnt how to get back using cp now :)

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


David Douthitt

David is an experienced UNIX and Linux system administrator, a former Linux distribution maintainer, and author of two books ("Advanced Topics in System Administration" and "GNU Screen: A Comprehensive Manual"). View David Douthitt's profile on LinkedIn Support freedom The Internet Traffic Report monitors the flow of data around the world. It then displays a value between zero and 100. Higher values indicate faster and more reliable connections.

Recent Posts

Top Posts

RSS Sharky’s Column!

Calendar

May 2008
M T W T F S S
« Apr   Jun »
 1234
567891011
12131415161718
19202122232425
262728293031  

Recent Comments

Anthony on About
MikeT on Stress Relief: Laugh Out Loud…
yungchin on Sparse files – what, why…
Randal L. Schwartz on Perl Tidbits: Annoyances and…
Court on Perl Tidbits: Annoyances and…

Category Cloud

BSD Career Conferences Debian Debugging Disaster recovery Fedora FreeBSD HP-UX Legal Linux MacOS X Mobile Computing Networking OpenBSD OpenSolaris OpenVMS Personal Notes Portable Code Presentations Productivity Programming Red Hat Scripting Security Solaris Storage Tips Ubuntu UNIX

Archives

Feeds

Blogroll

Pages

Meta