Apparently this installer wants a title

Various ramblings from David Benjamin

Tar-filled pipes

with 7 comments

Follow-up to A Very Subtle Bug from nelhage, reposted from a discussion on zephyr.

The tar format is, conceptually, a very simple one. You concatenate a bunch of files together and preface each with a metadata header (path, size, etc.). Partial extraction of a single file requires a linear walk across the archive until you find the record you want. Of course, once you’ve extracted it, the file can be closed and no more work need be done. This, combined with a piped gzip and Python’s odd SIGPIPE handling, gives the problem from nelhage’s A Very Subtle Bug.

But the details don’t quite seem to work that way. lbzip2 on reddit notes that, on a large file,

GNU tar 1.20 didn’t stop reading from lbzip2 after finding and extracting the file from the tar stream. (That stream continues after the specified file for another 270M or so, and the compressed tarball continues for another 47M or so.)

So what is going on here? Because it’s so much fun, let’s source-dive! The primary loop is a read_and function in src/list.c (abridged):

/* Main loop for reading an archive.  */
read_and (void (*do_something) (void))
  /* [Initialize some things...] */
  open_archive (ACCESS_READ);
      prev_status = status;
      tar_stat_destroy (&current_stat_info);

      status = read_header (false);
      /* [Call do_something () per appropriate header] */
  while (!all_names_found (&current_stat_info));

  close_archive ();
  names_notfound ();            /* print names not found */

Certainly looks like we close the archive after we’ve seen everything we care about. Looking at all_names_found from src/names.c, it iterates over the arguments reasonably and checks if they’ve all been seen. However, there is one funny check before that loop:

  if (!p->file_name || occurrence_option == 0 || p->had_trailing_slash)
    return false;

occurrence_option corresponds to the --occurrence option. Quoth the man page:

       process only the NUMBERth occurrence of each file in the archive;

What does that mean? Well, like I said, tar files are very simple. You concatenate files together. They are so simple that duplicate files are allowed. Both versions get extracted and the later ones override the earlier ones. tar does not, and cannot, abort upon seeing all files because there may be newer versions later. The --occurrence option allows you to specify that you want a particular set of versions. Only then will tar prematurely cut off the pipe.

Given that, why the occasional SIGPIPE bug? We’ve established that, by default, tar will not prematurely close the pipe after extracting, so there must be some place where we close the pipe. Looking back to read_and, it does break out of the loop in other cases: end of file (HEADER_END_OF_FILE) and NUL block (HEADER_ZERO_BLOCK). The latter is handled by this snippet (abridged):

      if (block_number_option)
          char buf[UINTMAX_STRSIZE_BOUND];
          fprintf (stdlis, _("block %s: ** Block of NULs **\n"),
                   STRINGIFY_BIGINT (current_block_ordinal (), buf));

      set_next_block_after (current_header);

      if (!ignore_zeros_option)
          /* [Long comment about POSIX compatibility, disabled warning] */
      status = prev_status;

Unless one passes -i or --ignore-zeroes, NUL blocks are treated as EOF. And indeed, if one inspects a random tar file with -i and --block-number,

davidben@rupert:/tmp% tar -tzf tar_1.22.orig.tar.gz -i --block-number | tail
block 22151: ** Block of NULs **
block 22152: ** Block of NULs **
block 22153: ** Block of NULs **
block 22154: ** Block of NULs **
block 22155: ** Block of NULs **
block 22156: ** Block of NULs **
block 22157: ** Block of NULs **
block 22158: ** Block of NULs **
block 22159: ** Block of NULs **
block 22160: ** End of File **

(This file appears to end in 22 of them.) And now we have the culprit. Tar files end with a few NUL blocks, signifying end-of-file. tar closes the pipe on the first, leaving a few blocks written by gzip and ignored by tar. This race condition allows for tar to finish before gzip does, triggering the Python problem.

A final note: don’t start passing --occurrence to all your tar calls. The logic in all_names_found does rather odd things with directories and does strange things with some tarballs. This will be the subject of a future post, possibly after some mail with

Written by davidben

February 28th, 2010 at 5:07 pm

Posted in Software

Tagged with ,

7 Responses to 'Tar-filled pipes'

Subscribe to comments with RSS or TrackBack to 'Tar-filled pipes'.

  1. Jeebus, thank you so much for posting this. I ran into this issue a couple weeks ago and it was pretty difficult to debug given that it only happened on a specific build machine for some reason. I’ll sleep better tonight knowing that there’s nothing more insidious going on.


    28 Feb 10 at 11:34 pm

  2. That is interesting, and makes perfect sense. I have written a backup system for Cyrus that makes very sneaky use of the tar format – I wound up writing a custom Perl module that can read and write tar streams without having to touch the disk – in particular they can be used to strip stale files out of a tar file by streaming it through a decider function.

    I use an external gzip process, but obviously not an external tar, so I close it explicitly if I don’t need to finish it. It’s possible to embed the gzip directly inside the process too, but I kind of like the “use multiple cores” side effect of letting a separate process do the compression.

    Bron Gondwana

    1 Mar 10 at 2:40 am

  3. This is awesome. I actually just got a VPS (no sites on it yet) and posts like this make me happy. Sometimes simple things can be daunting if you aren’t familiar, and I am a stranger to python but hope to learn.

    Matt Sandy

    1 Mar 10 at 4:05 am

  4. Thank you very much for investigating the phenomenon in the tar source.

    I posted an inquiry to the help-tar mailing list:

    Perhaps the discussion should be migrated there (or bug-tar as you say).

    lacos / lbzip2

    lacos (lbzip2)

    1 Mar 10 at 10:13 am

  5. lacos: Oh, the thing I wanted to poke bug-tar for was something else; all_names_found can’t really reliably check if you’ve exhausted a directory. The current code assumes all directories are contiguous (not sure if the spec mandates that they be contiguous… it’s certainly very easy to create one where they aren’t), but does the check in a somewhat strange manner such that, for non-contiguous directories, you get very different results depending on surrounding file names.

    As far as I’m aware, the NUL blocks should explain the SIGPIPE sending. It’s a fairly tight race condition, which fits with it only occurring some of the time; there’s a small enough tail that it depends on how much buffering both your filters and tar do, as well as who happens to get ahead of the other.


    1 Mar 10 at 11:04 am

  6. I should probably clarify for folks — I didn’t find or investigate the Python SIGPIPE thing. Just looked into a detail of a friend’s discovery I found curious.


    1 Mar 10 at 11:14 am

  7. Re: davidben, March 1, 2010 at 11:04 am: yes, the NUL blocks do explain the broken pipe. IIRC I sent the inquiry to help-tar first, then went to reddit second, and saw your blog entry posted there. Thanks.


    1 Mar 10 at 3:37 pm

Leave a Reply