Sparse files – what, why, and how
23 May 2008 14 Comments
Sparse files are basically just like any other file except that blocks that only contain zeros (i.e., nothing) are not actually stored on disk. This means you can have an apparently 16G file – with 16G of “data” – only taking up 1G of space on disk.
This can be particularly useful in situations where the full disk may never be completely used. One such situation would be virtual machines. If a virtual machine never fills the disk entirely, then a certain amount of the disk will never have anything but zeros in it – permitting the saving of disk space by using a sparse file.
The operating system (which supports sparse files) knows that the block “exists” but is null, so it provides the zero-filled block out of thin air. As soon as the block contains data, the data is written to disk in the appropriate way and the file on disk grows.
There are problems with using sparse files. The most egregious would be that any utility that does not recognize or utilize sparse files can replicate the entire file on disk, so that a 500M sparse file could suddenly balloon to 6G. Even utilities that can work with sparse files must be told to do so, so this trap is easy to fall into.
Another problem is that everything about disk management is based on how many sectors or blocks are used – and the disk size reported for a sparse file is the full size of the file (not the actual number of blocks on disk). This also means that any utilities that work solely with blocks on disk (e.g., du) will report different amounts than other utilities.
Yet another problem is whether backup programs recognize or preserve sparse files. A backup program that does not recognize (and store) sparse files may well be quite oversized. A restore of this flawed backup – or indeed, a restore that does not recognize properly backed up sparse files – will balloon in size as mentioned before. In a worst case, there would not be enough room on disk to restore the data, if the sparse file expands enough to fill the disk.
To get the ls command to report actual on-disk sizes (instead of just the file size typically reported) use these options:
# ls -ls
8 -rw-r--r-- 1 root root 2003 Aug 14 2006 anaconda-ks.cfg
68 -rw-r--r-- 1 root root 59663 Aug 14 2006 install.log
8 -rw-r--r-- 1 root root 3317 Aug 14 2006 install.log.syslog
12 -rw------- 1 root root 10164 Oct 25 2006 mbox
The blocks used are in the left side column; the bytes used are in their usual column just before the date. In this case, all of the files are using appropriate number of blocks – that is, none of these files are sparse.
Creating a sparse file basically amounts to a simple process: create a file, and seek to the desired end of the file. This can be done at the command line with a command like (to create a 1M sparse file):
dd if=/dev/zero of=sparse-file bs=1 count=1 seek=1024k
Here is an example, run on HP-UX:
# dd if=/dev/zero of=sparse bs=1 count=1 seek=1024k
1+0 records in
1+0 records out
# ls -ls sparse*
2 -rw-r--r-- 1 root sys 1048577 May 22 12:58 sparse
Looking at HP-UX (and perhaps other environments), there does not appear to be a wide amount of support for sparse files in the typical utilities (such as tar, cp, cpio, etc.), even as the operating system itself will dutifully create sparse files.
However, GNU cp supports the –sparse option. By default, GNU cp attempts to detect sparse files and recreates them as warranted. A file can be copied into a sparse format using –sparse=always, or into a nonsparse format using –sparse=never. The default, alluded to earlier, is –sparse=auto.
GNU tar uses the –sparse option (or the equivalent -S option) to make tar store sparse files appropriately.
GNU cpio supports the –sparse option, which operates similarly: any suitable length of zeros is recorded as a “empty” and the file is stored or created as a sparse file.
The command rsync, while not a GNU project, does support sparse files. The option –sparse (or -S) will attempt to handle sparse files efficiently, that is, not creating file blocks full of only zeros.
It appears that utilities pax, scp, sftp, and ftp do not in general support sparse files. Using such a utility, then, would make a sparse file balloon in size.
Given this utility support for sparse files, the best native environment is proably Linux or MacOS X, or other open platform. More specifically, any platform where the GNU utilities are available, along rsync.
Recognizing a sparse file can be done at the command line with the previously mentioned ls -ls command. To see a sparse file (right down to the block level), one way would be to write a C program (or other language) to start at the end of the file, and truncate the file one block shorter – if the file on disk becomes shorter, then that block was on disk; if not, then it was a sparse block. Of course, any block that contains data is on disk; it is only blocks with zeros with which there is any question as to whether it is on disk or not. Jonathan Corbet had an article in December of 2007 in the Linux Weekly News that described a method that the Solaris ZFS developers had proposed, and posed the question as to whether Linux should support similar system calls that will seek to the next hole or the next set of data in a file.