Sparse files – what, why, and how

Sparse files are basically just like any other file except that blocks that only contain zeros (i.e., nothing) are not actually stored on disk. This means you can have an apparently 16G file – with 16G of “data” – only taking up 1G of space on disk.

This can be particularly useful in situations where the full disk may never be completely used. One such situation would be virtual machines. If a virtual machine never fills the disk entirely, then a certain amount of the disk will never have anything but zeros in it – permitting the saving of disk space by using a sparse file.

The operating system (which supports sparse files) knows that the block “exists” but is null, so it provides the zero-filled block out of thin air. As soon as the block contains data, the data is written to disk in the appropriate way and the file on disk grows.

There are problems with using sparse files. The most egregious would be that any utility that does not recognize or utilize sparse files can replicate the entire file on disk, so that a 500M sparse file could suddenly balloon to 6G. Even utilities that can work with sparse files must be told to do so, so this trap is easy to fall into.

Another problem is that everything about disk management is based on how many sectors or blocks are used – and the disk size reported for a sparse file is the full size of the file (not the actual number of blocks on disk). This also means that any utilities that work solely with blocks on disk (e.g., du) will report different amounts than other utilities.

Yet another problem is whether backup programs recognize or preserve sparse files. A backup program that does not recognize (and store) sparse files may well be quite oversized. A restore of this flawed backup – or indeed, a restore that does not recognize properly backed up sparse files – will balloon in size as mentioned before. In a worst case, there would not be enough room on disk to restore the data, if the sparse file expands enough to fill the disk.

To get the ls command to report actual on-disk sizes (instead of just the file size typically reported) use these options:

# ls -ls
total 96
8 -rw-r--r-- 1 root root 2003 Aug 14 2006 anaconda-ks.cfg
68 -rw-r--r-- 1 root root 59663 Aug 14 2006 install.log
8 -rw-r--r-- 1 root root 3317 Aug 14 2006 install.log.syslog
12 -rw------- 1 root root 10164 Oct 25 2006 mbox

The blocks used are in the left side column; the bytes used are in their usual column just before the date. In this case, all of the files are using appropriate number of blocks – that is, none of these files are sparse.

Creating a sparse file basically amounts to a simple process: create a file, and seek to the desired end of the file. This can be done at the command line with a command like (to create a 1M sparse file):

dd if=/dev/zero of=sparse-file bs=1 count=1 seek=1024k

Here is an example, run on HP-UX:

# dd if=/dev/zero of=sparse bs=1 count=1 seek=1024k
1+0 records in
1+0 records out
# ls -ls sparse*
2 -rw-r--r-- 1 root sys 1048577 May 22 12:58 sparse
#

Looking at HP-UX (and perhaps other environments), there does not appear to be a wide amount of support for sparse files in the typical utilities (such as tar, cp, cpio, etc.), even as the operating system itself will dutifully create sparse files.

However, GNU cp supports the –sparse option. By default, GNU cp attempts to detect sparse files and recreates them as warranted. A file can be copied into a sparse format using –sparse=always, or into a nonsparse format using –sparse=never. The default, alluded to earlier, is –sparse=auto.

GNU tar uses the –sparse option (or the equivalent -S option) to make tar store sparse files appropriately.

GNU cpio supports the –sparse option, which operates similarly: any suitable length of zeros is recorded as a “empty” and the file is stored or created as a sparse file.

The command rsync, while not a GNU project, does support sparse files. The option –sparse (or -S) will attempt to handle sparse files efficiently, that is, not creating file blocks full of only zeros.

It appears that utilities pax, scp, sftp, and ftp do not in general support sparse files. Using such a utility, then, would make a sparse file balloon in size.

Given this utility support for sparse files, the best native environment is proably Linux or MacOS X, or other open platform. More specifically, any platform where the GNU utilities are available, along rsync.

Recognizing a sparse file can be done at the command line with the previously mentioned ls -ls command. To see a sparse file (right down to the block level), one way would be to write a C program (or other language) to start at the end of the file, and truncate the file one block shorter – if the file on disk becomes shorter, then that block was on disk; if not, then it was a sparse block. Of course, any block that contains data is on disk; it is only blocks with zeros with which there is any question as to whether it is on disk or not. Jonathan Corbet had an article in December of 2007 in the Linux Weekly News that described a method that the Solaris ZFS developers had proposed, and posed the question as to whether Linux should support similar system calls that will seek to the next hole or the next set of data in a file.

About these ads

14 Responses to Sparse files – what, why, and how

  1. Pingback: Virtual Machine Disk Image Compression « Share Virtual Machines

  2. Pingback: Compressing Virtual Images « Share Virtual Machines

  3. Paul Evans says:

    Excellent article, David. There is a fair amount of interest on the Web on sparse files thanks to the proliferation of VM’s. Users prefer VM’s that can grow dynamically instead of allocating a large amount of storage statically at creation time. I have linked to your article in my survey at

    http://sharevm.wordpress.com/2008/12/13/virtual-machine-disk-image-compression/

    to help people understand the nuances of their choices

  4. yungchin says:

    Thank you for typing this up, it was incredibly useful to me. You beat the Wikipedia page in comprehensiveness: it doesn’t include tar, cpio or rsync.

    Of course I wish I’d read this before ballooning a file and filling my whole disk, but at least I learnt how to get back using cp now :)

  5. valk says:

    Very userful thank you ! I was about to fall into the ballooning trap before i read your article. ;)

  6. sustainabilityblogger says:

    very Useful David, i will share this link okey
    by mvo

  7. Mark says:

    Very informative; thank you. I’m sure for each of us that posts, there’s 100 others who do not.

  8. 10010 says:

    Great one! This’s the only src that helped me understand sparse file and its usage

    Thx

  9. TrafficWomble says:

    Many thanks for this info … Am off to get a copy of the GNU versions of tar and cpio right away!

  10. Pingback: Backup size limiting

  11. Louis says:

    Many thanks !

  12. Mr lonely says:

    i am working on a open pilot for marine graph plotter and and i am getting this error if you can help me ..because it is related with sparse file handler the error is

    cpl_vsil.obj : error LNK2019: unresolved external symbol _VSIInstallSparseFileHandler referenced in function “private: static class VSIFileManager * __cdecl VSIFileManager::Get(void)” (?Get@VSIFileManager@@CAPAV1@XZ)

    a function is called like VSIInstallSparseFileHandler() and i am getting nothing related to this

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 43 other followers

%d bloggers like this: