TA: the Transparent Archivist

(Version 1.8)

Copyright © 2005 David Flater.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>.

THE TA SOFTWARE DISTRIBUTION IS AVAILABLE FROM: https://flaterco.com/files/ta/

Contents

  1. What TA is
  2. Requirements
  3. Limitations
  4. Installation
  5. Building archives
  6. Burning to disc
  7. Mounting and unmounting archives
  8. Validating archives
  9. Restoring archives
  10. Adding hashes to other things
  11. The –l switch
  12. Windows XP portability problems
  13. Troubleshooting
  14. Support
  15. Changelog

What TA is

TA, the Transparent Archivist, is a front-end program that reduces the fuss and muss involved in building archives under Linux.  It is "transparent" in that the archives that it produces can be examined with ls and restored with cp.  You do not need TA to recover an archive built by TA.

TA is not a disc burning program.  TA builds disc images but does not burn them.  The value added by TA is:

  1. automatically determines where to break archives across disc boundaries;
  2. automatically stores and validates SHA-512 hashes for all files;
  3. simplifies the process of building disc images.

TA offers eight choices of file system for the archives, including three variants of ISO 9660, three variants of ext2, squashfs, and UDF.  However, UDF support is experimental.

Requirements

Not all packages are required in all modes (e.g., you don't need squashfs-tools unless you are making squashfs discs).

Package / programVersion tested
Linux kernel3.7.2
bash4.2
GNU coreutils (cp and rm)8.19
e2fsprogs (mke2fs and tune2fs)1.42.6
file5.11
GNU find4.4.2
libdstr1.0
mhash (libmhash)0.9.9.9
zlib1.2.6
cdrtools (mkisofs)3.01a08 *
udftools (mkudffs)1.0.0b3
util-linux (losetup, mount)2.21.2
squashfs-tools (mksquashfs)4.2

* In certain Linux distributions, mkisofs is a symbolic link to, or simply replaced by, a program called genisoimage.  This is not cdrtools but a forked project known as cdrkit.  Genisoimage has issues and its use with TA is not supported.

To run TA you must have plenty of free disk space.  In addition to the room needed for the final images, iso, isoj, isorr and squashfs require room for a temporary copy of all of the files on a given disc.

To build ext2, le2a, le2f and udf archives you must have sufficient privileges to mount and unmount file systems.

Preserving ownership on archived files or archiving files with unfriendly permissions requires TA to be run as root.

Your kernel must support whatever file systems you are using.

Limitations

The limits stated below for the various file systems are not standard or theoretical limits, but real limits determined by testing under Linux 2.6.x.  Your mileage should not vary.

IdentifierDescriptionFile name encodingFile name length limitFile size limitDate limitsFile ownership, permissions, symbolic links?Compression?Problems reading under Windows XP
isojISO-9660:1988 Level 1 extended by Joliet and deep directories UTF-16*206 B (103 char.)(232 − 2) B = 4294967294 B1970–2027NoNoLower file size limit, dates messed up
isoISO-9660:19998-bit agnostic207 B(232 − 2) B = 4294967294 B1970–2027NoNoCharset mismatch, forbidden characters, lower file size limit, dates messed up
isorriso plus Rock Ridge protocol8-bit agnostic248 B(232 − 2) B = 4294967294 B1970–2027YesNoRock Ridge ignored, names truncated, charset mismatch, forbidden characters, lower file size limit, dates messed up
ext2Linux file system8-bit agnostic255 B> 9 GiB1902–2037YesNoNo support
le2aext2 plus 256-bit AES encryption, LUKS-compliant 8-bit agnostic255 B> 9 GiB1902–2037YesNoNo support
le2fext2 plus 256-bit Twofish encryption, LUKS-compliant 8-bit agnostic255 B> 9 GiB1902–2037YesNoNo support
squashfsCompressed file system 8-bit agnostic255 B> 9 GiB1902–2037YesgzipNo support
udfUniversal Disk Format (EXPERIMENTAL) UTF-8254 B> 9 GiB1970–2037YesNoUnreliable, dates messed up

* The following characters, which are allowed in Linux file names, are lost in translation to Joliet:  *:;?\

UDF support is experimental.  Kernel stability problems have been experienced while populating UDF file systems under Linux.

TA does not handle any file types other than regular files, directories, and symbolic links.

TA does not preserve empty directories.

TA does not preserve timestamps on directories.

TA does not handle files that are too big to fit on the target media in one piece.  TA does not split files and it does not reorder files when filling up discs.

You may not archive a file in the root directory called ta-hashes.txt, because that is where ta stores the hashes.

If a file to be archived changes while TA is running, its hash will be wrong.  If a file to be archived is deleted while TA is running, TA will exit with an error.

Installation

TA is packaged with the popular and portable GNU automake, so all usual GNU tricks should work.  Help on configuration options can be found in the INSTALL file or obtained by entering ./configure --help.

Normally, one should only need to do the following to compile and install the programs ta, tahash, and taval:

bash-3.1$ ./configure
bash-3.1$ make
bash-3.1$ su
bash-3.1# make install

The distribution includes source for a program called createfile that is only useful for testing TA.  It is not normally built.  If for some strange reason you want to build it, configure with --enable-test-progs.

Building archives

Usage: ta [options] discsize filesystem workingdir imagedir src [src...]

Options:
-l      Tweak ext2/le2a/le2f to maximize usable space.
-nornd  Don't initialize encrypted volumes with random data.
-p      Force file permissions to reasonable defaults.
-r src  Replicate file src in *every* image.
-w      Wait for confirmation after completing each image.

discsize:  cd74, cd80, dvd+r, dvd+rdl, or an arbitrary size specified in bytes.
filesystem:  iso, isoj, isorr, ext2, le2a, le2f, squashfs or udf.
workingdir:  for ext2, le2a, le2f or udf this is just a mount point that we can
  use.  For others, this must be an existing, empty directory that we can fill
  up and then wipe clean.
imagedir:  disc images will be written here, overwriting any files that are in
  the way.  Make sure it is on a file system that can handle big files if you
  are creating DVD-sized images.
src:  stuff to archive.  Should usually be a directory, but you can do single
  files if you want.

The identifiers cd74 and cd80 refer to 74- and 80-minute CD-Rs or RWs.  The identifiers dvd+r and dvd+rdl refer to DVD+R/RW and DVD+R DL.  Standard capacities are not available for DVD-R/RW or DVD-R DL.

The –p option will set the permissions on directories and executable files to rwxr-xr-x and on non-executable files to rw-r--r--.  (Although the archive will be read-only, making files unwritable by owner creates more trouble than it is worth.)  For iso and isoj this option has no effect.

The –w option is useful if you have inadequate disk space to store all of the images being produced.  TA will wait for you to burn and delete the previous image before beginning the next one.

The –l option is useful only if you need a few more megabytes to fit a few large files onto a DVD.  See details below.

The –nornd option will speed up the creation of encrypted volumes for le2a and le2f filesystems at the cost of not obfuscating the location of encrypted data on a less-than-full volume.

The translation of src paths into paths within the image is done more or less the way that tar does it:  /mumble/foo (absolute) and mumble/foo (relative) both translate to mumble/foo in the image.  However, references to "." are removed from the final file names, and references to ".." are not allowed.

TA leaves the disc images in the directory that you specify as imagedir.  Disc images are named image001.iso, image002.iso, and so forth.  Even non-iso images are called .iso because anything else can confuse disc-burning applications.

If you are building DVD images, imagedir must be on a file system that can support files larger than 4 GiB (i.e., not vfat).

It is not a good idea to do other work on the side while TA is archiving.  If you modify a file in TA's list, the hash will be wrong and the file will not validate.  If you delete a file in TA's list, TA will fail.

Burning to disc

You can use whatever disc burning application you like to burn the images to disc.  Following are sample commands that seem to work under Linux.  Your mileage may vary.

Target mediaBurning command
CD-Rcdrecord –v dev=/dev/cdrom –dao image001.iso
CD-RWcdrecord –v blank=fast dev=/dev/cdrom –dao image001.iso
DVD+R/RW/DLgrowisofs –dvd-compat –Z /dev/dvd=image001.iso

(N.B., using the source of cdrtools-2.01, the sao and dao options to cdrecord are completely equivalent.)

Mounting and unmounting archives

In theory, mounting an archive is a simple application of the standard mount command, but there are enough special cases to warrant the following quick reference guide.

iso:   mount -t iso9660 -o ro,map=o /dev/cdrom /mnt
isoj, isorr:   mount -t iso9660 -o ro /dev/cdrom /mnt
(or -o ro,utf8 to read a Joliet disc in a UTF-8 locale)
ext2:   mount -t ext2 -o ro /dev/cdrom /mnt
udf:   mount -t udf -o ro /dev/cdrom /mnt
(or -o ro,iocharset=iso8859-1 to read a UDF disc in a Latin-1 locale)
squashfs:   mount -t squashfs -o ro /dev/cdrom /mnt
le2a, le2f:   MAPNAME=`date +%N` # Pick a unique map name
LOOPDEV=`losetup -f -s /dev/cdrom`
cryptsetup luksOpen $LOOPDEV $MAPNAME
mount -t ext2 -o ro /dev/mapper/$MAPNAME /mnt

Unmounting:

le2a, le2f:   umount /mnt
cryptsetup luksClose $MAPNAME
losetup -d $LOOPDEV
Everything else:   umount /mnt

Use of le2a and le2f archives can be simplified by adding the following to your ~/.bash_profile:

function mountle2 {
  MAPNAME=`date +%N` # Pick a unique map name
  LOOPDEV=`losetup -r -f --show $1`
  cryptsetup --readonly luksOpen $LOOPDEV $MAPNAME
  mount -t ext2 -o ro /dev/mapper/$MAPNAME /mnt
  echo $1 "is now mounted on /mnt.  Exit this shell to unmount it."
  PS1="le2# " bash -i
  umount /mnt
  cryptsetup luksClose $MAPNAME
  losetup -d $LOOPDEV
}

Then you need type only mountle2 /dev/dvd and the passphrase to mount an encrypted archive and exit to unmount it:

bash-3.1# mountle2 /dev/dvd 
Enter LUKS passphrase: 
key slot 0 unlocked.
Command successful.
/dev/dvd is now mounted on /mnt.  Exit this shell to unmount it.
le2# ls -l /mnt
total 177
-rwxr-xr-x 1 root root 178843 2008-08-08 16:46 ta
-rw-r--r-- 1 root root    134 2008-08-11 13:18 ta-hashes.txt
le2# exit
exit
bash-3.1#

Validating archives

To validate a disc, mount the disc someplace and do taval someplace.  Taval will check the hashes on all regular files.

mount -o ro,map=o /dev/cdrom /mnt
taval /mnt

For a second opinion on the validity of a given file, you can manually compare the contents of /ta-hashes.txt with the output of gpg --print-md sha512.

Taval only checks the contents of files that are listed in /ta-hashes.txt.  It does not ensure that the dates, permissions, or other metadata were correctly preserved, nor does it notice if other files were added.

Using MD5 hashes

As a convenience, taval can also validate an archive against a file of MD5 hashes that was produced by some other program, e.g., md5sum.  To validate an archive against MD5 hashes instead of ta-hashes.txt, use the –md5 switch of taval:  taval -md5 md5file dir.

Each line of the MD5 hashfile must be 32 bytes of data, two spaces, and a filename:

d41d8cd98f00b204e9800998ecf8427e  null

Restoring archives

Since the archives are completely transparent, you can go directly to the disc(s) and directories that you want if you are in a hurry to retrieve something specific.

mount -o ro,map=o /dev/cdrom /mnt
cd /mnt
ls

Otherwise, repeat for each disc:

mount -o ro,map=o /dev/cdrom /mnt
cp -a /mnt/* /
umount /mnt

If not running as root, you might have to change some permissions in order to get all of the files to copy in.  When done, delete the extraneous file /ta-hashes.txt.

Adding hashes to other things

In some cases, e.g. an FTP archive, it is handy to generate hashes without getting involved in making disc images.  You can do this with tahash dir, and the directory's contents can subsequently be validated with taval dir.

The –l switch

The –l switch causes TA to create ext2/le2a/le2f file systems with tweaked options to maximize usable space.  On a single-sided DVD+R, this reduces the overhead on a newly created, empty filesystem from 81176 KiB to a mere 500 KiB.  However, it limits the number of files that can be placed on a single disc to around 1100 (1152 inodes), eliminates extended attributes, and may slow down file access (no B-tree indices).

The command used is mke2fs -m 0 -N 1 -O none,sparse_super,filetype -I 128 ...  As usual, no space is reserved for super-user, and the lost+found directory is removed.

Windows XP portability problems

About character sets and portability

Character sets are irrelevant if you never venture beyond the plain old U.S. keyboard for naming your files.  However, if your file names have umlauts in them, then you'll need to pay attention to this.

"8-bit agnostic" file name encoding means that file names are recorded as strings of 8-bit characters with no translation.  Whether your ambient codeset is ISO 8859-1 (the Linux default), UTF-8, or whatever, that is what goes on the disc.  As long as you read the disc in the same context in which it was recorded, all file names should survive intact.  Alas, Windows XP interprets agnostic characters according to its own default code page, which unfortunately is usually cp437.  In theory, you should be able to say chcp 1252 and then access a disc encoded as ISO 8859-1 with no trouble.  In practice, that doesn't work.

One solution is to switch your Linux environment over to code page 437.  Yeah.  Right.  The better solution is to use isoj.

Joliet and UDF don't have the character set portability problem because they both specify unambiguous encodings for file names.  All characters translate correctly under Windows XP as long as the codeset assumed when writing the disc was correct.  TA determines your codeset from your locale.  If your codeset is supported, whatever you see in a directory listing under Linux is what you should see under Windows.

Joliet has only been tested with ISO 8859-1, but other ISO 8859 codesets should work.  UTF-8 is not supported for isoj because it is not supported by mkisofs.

UDF has only been tested with ISO 8859-1 and UTF-8, but other ISO 8859 codesets should work.

Codesets other than ISO 8859 and UTF-8 definitely will not work with isoj or udf because I cannot guess how to map the codeset names reported by nl_langinfo to the codeset names expected by mkisofs and mount.

Ironically, it is less simple to read Joliet and UDF file names correctly under Linux than it is under Windows because the mount program does not automatically detect your locale.  Joliet discs default to ISO 8859-1 translation and UDF discs default to UTF-8.  To read a Joliet disc in a UTF-8 locale, use the utf8 mount option, e.g., mount -t iso9660 -o ro,utf8 /dev/cdrom /mnt.  To read a UDF disc in an ISO 8859-1 locale, use the iocharset=iso8859-1 mount option, e.g., mount -t udf -o ro,iocharset=iso8859-1 /dev/cdrom /mnt.

Other portability issues

Lower file size limit:  For ISO 9660 discs, the file size limit under Windows XP is (232 − 2048) B = 4294965248 B.  Files larger than this produce an "Input/Output error" on attempt to open.

Dates messed up:  Dates in the archive are often wrong by an hour (apparently Daylight Savings Time run amok), and years before 1980 are not supported.

Forbidden characters:  Windows XP has different rules for what characters are legal in file names.  Files whose names contain an asterisk, backslash, or question mark appear to be inaccessible under Windows XP.  Files whose names contain a colon or semicolon are accessible from a Cygwin command line, but they cannot be opened in Windows Explorer.  If Joliet is used, forbidden characters are suppressed; consequently, the files are accessible under Windows but they do not validate because their names were changed.

Unreliable:  UDF support is experimental.  Windows XP sometimes has problems reading UDF discs, for reasons unknown.

Troubleshooting

Permission denied

Even when –p is used, TA can fail with "permission denied" if a source directory is not writable by owner.  This problem is a consequence of how GNU cp propagates permissions and is not efficiently fixable in TA.  The recommended workaround is to run TA as root.

Nero refuses to burn images created by TA

The following configuration changes suffice to adjust the attitude of Nero Express 6.

Under Configure→General, uncheck "Check for correct disc format before burning" and "Check Joliet file names before burning."

Under Configure→Misc, set "Burn DVD if the data compilation does not fit on CD" to Always.

For some reason, when Nero Express 6 is presented with a CD image that is precisely at the capacity of the disc, it claims that it is 0.01 second too long (79:57.74 instead of 79:57.73).  Since by default TA creates ext2, le2a, le2f and udf images to fill the entire disc, you may need to enable overburning to write these types of images, or else manually set the TA discsize slightly smaller.

Nero fails with "Parameter error" when trying to start burn

No solution found except to use different burning software.

Validation errors—files have incorrect names on disc

Cause #1:  The default Linux mount options are wrong for iso.  The disc is actually fine; you just need to mount it differently.  To prevent file names from being case-smashed, you must mount with the option map=o; e.g., mount -t iso9660 -o ro,map=o /dev/cdrom /mnt.  This problem only occurs with iso.

Cause #2:  If your working directory is on a vfat partition, using the wrong mount options will result in a corrupt disc.  To prevent short file names from being case-smashed, you must mount the vfat partition with the option shortname=winnt.

Cause #3:  Character set problems.  This is very likely if you are trying to read an isoj disc in a UTF-8 locale or a UDF disc in the default Linux locale.  To read a Joliet disc in a UTF-8 locale, use the utf8 mount option, e.g., mount -t iso9660 -o ro,utf8 /dev/cdrom /mnt.  To read a UDF disc in an ISO 8859-1 locale, use the iocharset=iso8859-1 mount option, e.g., mount -t udf -o ro,iocharset=iso8859-1 /dev/cdrom /mnt.  See About character sets and portability for more information.

Cause #4:  The following characters, which are allowed in Linux file names, are lost in translation to Joliet:  *:;?\  Any affected files will fail validation.

How to mount an ext2aes disc

Versions of TA prior to version 1.4 supported a file system called ext2aes, which was ext2 plus 256-bit AES encryption via the now deprecated cryptoloop module of the Linux kernel.  ext2aes has no LUKS header to tell you what it is; you just have to know.

To mount an ext2aes disc, use this command:  mount -t ext2 -o ro,encryption=aes /dev/dvd /mnt.  It will prompt you for a password.  Given the correct one, the disc contents should then appear under /mnt.  When finished, just umount /mnt.

If your kernel does not include the deprecated cryptoloop module or you just want to do it the hard way, you can also mount an ext2aes disc using dm-crypt, as follows:

MAPNAME=`date +%N`                        # Pick a unique map name
LOOPDEV=`losetup -f -s /dev/dvd`
cryptsetup -c aes-cbc-plain -s 256 -h plain create $MAPNAME $LOOPDEV
mount -t ext2 -o ro /dev/mapper/$MAPNAME /mnt

To unmount:

umount /mnt
cryptsetup remove $MAPNAME
losetup -d $LOOPDEV

How to mount a ziso disc

Versions of TA prior to version 1.5 supported a file system called ziso, which was isorr plus a Linux-specific transparent decompression extension.  In version 1.5, ziso was replaced by squashfs.

ziso archives can be mounted and unmounted using the same commands as isorr.  However, to read ziso archives with transparent decompression, you must have a Linux kernel that was compiled with support for transparent decompression.  As of kernel.org kernel 2.6.27.31, the relevant option appears in make menuconfig as File systems → CD-ROM/DVD Filesystems → ISO 9660 CDROM file system support → Transparent decompression extension.

If kernel support is lacking, the content can be non-transparently decompressed using the mkzftree program included in the zisofs-tools package (with the –u option for uncompress).

–p fails at "setperm"

You must install the setperm script (provided in the TA distribution) somewhere in your path.  Setperm is a shell script that automagically decides what permissions a file should get.

Hard disk thrashing

If multiple physical hard drives are available, the best solution by far is to locate the source files, workingdir, and imagedir in such a way that all copy operations are going from one drive to another instead of from one place to another place on the same drive.  For ext2, le2a, le2f or udf, this means that imagedir should be on a different physical device than the source files.  For all other file systems, this means that workingdir should be on a different physical device than the source files, and imagedir should be on a different physical device than workingdir (but it can be on the same device as the source files).

The efficiency of same-drive copies depends heavily on the behavior of the I/O scheduler.  As of version 2.6.17.4 of the Linux kernel, three different I/O schedulers are available.  Disk thrashing has been a problem with the one called Anticipatory I/O scheduler.  If you can switch to the CFQ I/O scheduler, you should do so.  If not, the script stop-thrashing.sh, provided in the TA distribution and shown below, will tune the anticipatory scheduler in favor of large file copies to correct the thrashing.  The changes that it makes are temporary and will revert after a reboot.

#!/bin/bash
#
# EDIT THIS SCRIPT:
# Replace sda with the name of your hard disk device (e.g., hda).
#
# Documentation about the tuning parameters is in
# /usr/src/linux/Documentation/block/as-iosched.txt.
echo 1000 > /sys/block/sda/queue/iosched/read_expire
echo 1000 > /sys/block/sda/queue/iosched/write_expire
echo 1000 > /sys/block/sda/queue/iosched/read_batch_expire
echo 1000 > /sys/block/sda/queue/iosched/write_batch_expire

It is not uncommon for hard disk drives in PC type systems to enter invalid states if they are thrashed for several minutes.  The system locks up and the hard disk either does some looping behavior or takes itself offline.  If this happens, you can recover by pressing the reset button or cycling power.


Support

Any questions, problems, or bug reports for TA should be directed to dave@flaterco.com.


Changelog

Version 1.8, 2013-02-07:

Version 1.7.2, 2011-09-03:

Version 1.7.1, 2011-04-10:

Version 1.7, 2011-03-04:

Version 1.6.1, 2010-12-24:

Version 1.6, 2010-12-15:

Version 1.5.1, 2010-03-29:

Version 1.5, 2010-03-03:

Version 1.4, 2008-08-11:

Version 1.3.1, 2008-03-06:

Version 1.3, 2008-02-29:

Version 1.2.2, 2008-01-25:

Version 1.2.1, 2006-08-25:

Version 1.2, 2006-08-23:

Documentation rev. 2006-07-23:  Noted UDF troubles with XP.

Documentation rev. 2006-07-22:  Updated troubleshooting info for disk thrashing.  Added –v to CD burning command.

Version 1.1.2, 2006-07-04:

Version 1.1.1, 2006-07-04:

Version 1.1, 2006-07-03:

Documentation rev. 2006-05-27:  Added example command for burning DVD image.  Removed statement about Nero.

Version 1.0, 2006-01-02


Home