LDEO/Columbia
Media-prep is a perl script that facilitates the creation of multiple volumes of data from a single data store, in preparation for writing that data to some media (CDs, DVDs, DAT tapes, etc.).
Suppose you have a data archive of 50GB of various file types and sizes. Maybe this is stored on some distant network-attached-storage appliance. You need to archive those 50GB of data to DVDs. How do you do it? First, you need to get the data off the network-attached-storage device onto a system which can directly interface with a DVD burner. Second, you need to somehow split your 50GB of data into roughly 4.5GB chunks that can be written to DVD. When you look around, you'll find it difficult to locate DVD burning software that can take a single source directory and span the resulting files over multiple DVDs. (I don't know why this doesn't exist yet, but I am unaware of anything that does.) You might also want to create an index for each volume and a master index for all the volumes.
Enter media-prep.
media-prep's config file (media-prep.conf by default) explains a lot so we should start there. media-prep.conf has two main sections.
The first section allows one to specify details regarding if, and how, media-prep should mount the source data directory on another computer. Media-prep looks for our source directory and if it can't find it will use the items specified in this section to attempt to mount the distant source directly for you. Here you specify things like the source computer name how you want to mount the source directory (smb or nfs), the mount point, etc.
[NOTE: This section was created so that non-unix types need not worry about the ugly details of how to nfs or smb mount a distant file system. Or for times when the mount points occasionally break and no one's watching. media-prep would take care if it for you. However if your data is all on the local file system already (maybe you've already mounted the distant file system), media-prep will detect it and this first section is effectively ignored.]
The second section specifies details about the transfer itself. For example, one specifies a parent target directory where you want the data to go, an archive ID into which the resulting volumes will be created and your media volume capacity. Media-prep uses the "rsync" utility to conduct the file transfers. Several arguments are optionally passed through media-prep directly to the rsync commands, including "--dry-run" which allows one to test media-prep and allow rsync to tell you what it would have transferred.
media-prep also has several built-in indexing features. First, media-prep creates an index for each volume, and sticks it in the volume directory. This index contains each file name, it's modification date, and optionally the results of a checksum on the file. Media-prep also creates a master index for all the volumes and sticks a copy in each of the target volume directories. This way, if in 20 years someone is wading through the media volumes looking for something and can't find what they are looking for in the vol1-index they can quickly look in the master-index to see which volume it's in.
Finally, media-prep goes through an auditing routine. It checks the target volumes against the parent source directory and tells you if any files are missing or have different sizes and which files those are.
Media-prep has been tested on Linux and Mac OS-X (Darwin), using Perl 5.8.
The portion that's likely to fail on other systems is the auto-mounting feature, as different systems mount other file systems in various ways. You can likely get around this limitation by ensuring the source directory is mounted before running media-prep.
Perl modules required for media prep are...
File::Basename
File::Find
File::Path
locale
File::Copy
which I think are all bundled with most recent Perl distributions.
All the user-defined configuration instructions are contained in comments in the media-prep configuration file: media-prep.conf. Remember, this file must be written in well formed perl. If you're not familiar with perl, don't fret, have a look at the file, it syntax is quite clear.
There are two potentially large security issues with media-prep that are completely optional (and hence easily avoided). When mounting the source directory via smb, one may specify a username and password. These are specified in the config file which is not a very good place to have them hanging around. Similarly, one may optionally conduct rsync data transfers tunneling the connection over ssh rather than using the nfs or smb mounted share. This also requires one to specify a username and password and these again are specified in the config file.
I'll reiterate, one need neither let media-prep automatically mount the source file system using smb, (one could use nfs or mount the source file system manually.) or conduct the rsync transfer via ssh. In these cases, one can leave the username and password variables in the config file blank.
One may wonder why there is the option to transfer the data via ssh if you have to mount the distant file system anyway using nfs or smb. In creating media-prep it was not immediately clear (to me) if the on-the-fly compression feature to rsync might not provide a huge savings in data transfer speed. One would not want to use this feature when conducting the transfer over NFS, as rsync would not know in that case that the data not local, and would first read the data via NFS uncompressed only to compress it, uncompress it, and write it. However when operating via ssh, an rsync client on the distant end performs the compression before sending the file. So conceivably one could transfer the data via ssh and specify on-the-fly compression and gain some real savings in data transfer time. However it turned out that the time it takes to compress the data was far larger than then time required to transfer it uncompressed in all but the slowest of networks and the most compressible of data types. Therefore, the on-the-fly compression option was removed altogether. However at some point in the future, one would like to be able to execute media-prep entirely tunneled over ssh. Leaving the option to conduct the rsync data transfers via ssh leaves only figuring out how to collect the metadata about he source directory remotely. Something I'll work on at some point.
With all the details specified in the config file, one need only execute it like so:
./media-prep.pl media-prep.conf
Once media-prep cranks up it will create a log file in the local directory entitled media-prep-<archiveID>.log, where archiveID is whatever has been specified in the config file. It is best to monitor this config file to seethat things are behaving as you expect. Something like this will do the trick:
tail -f media-prep-archiveID.log
As an example, I've given things a run against my photos arhive, with a media_capacity set in the config file of 700MB - a reasonable value if I were thinking to burn them all to CDs as a backup mechansim. This first portion is through the execution of the first rsync command for the first volume.
Starting media_prep.pl: Mon Oct 18 13:26:55 EDT 2004
Checking for rsync...Found: /usr/bin/rsync.
Looking for /usr/export/photos...Success
Gathering file system metadata...Done
File Count: 3450
Directory Count: 160
Total Size: 3035.67712497711 MB or 2.96452844236046 GB
To be split into 5 volumes of 700 MB (0.68359375 GB) each.
Found archive index directory Photos-Archive_indexes
Writing indexes for volumes and volume directories...
Writing >Photos-Archive_indexes/vol-1.index...
Volume 1: 734943260 bytes (700.896511077881 MB, 0.684469249099493 GB)
Writing >Photos-Archive_indexes/vol-2.index...
Volume 2: 734243110 bytes (700.228796005249 MB, 0.683817183598876 GB)
Writing >Photos-Archive_indexes/vol-3.index...
Volume 3: 737174384 bytes (703.024276733398 MB, 0.686547145247459 GB)
Writing >Photos-Archive_indexes/vol-4.index...
Volume 4: 736131316 bytes (702.029529571533 MB, 0.6855757124722 GB)
Writing >Photos-Archive_indexes/vol-5.index...
Volume 5: 240646107 bytes (229.49801158905 MB, 0.224119151942432 GB)
Creating destination volume directories if required...
Found /usr/scratch/data-prep-tests/Photos-Archive/media-vol1
Transferring Photos-Archive_indexes/vol-1.index....
Executing...
rsync -av --delete --force --ignore-errors \
--include-from="Photos-Archive_indexes/vol-1.index" \
--delete-excluded --include="*/" --exclude="*" \
--stats \
/usr/export/photos/ \
/usr/scratch/data-prep-tests/Photos-Archive/media-vol1/ 2>&1
Please wait...
Each rsync command executes in turn, listing the files they've transferred in the log file. These are not included in this example, as they're trivial and boring. However at the end the source and target audit occurs shown below:
Verifying target volumes...
Files: Total Size: Date Sum:
Source: 3450 3183138177 3675010495617
Target: 3450 3183138177 3675011689502
Starting Index file Creation: Mon Oct 18 13:39:09 EDT 2004
Finished Volume 1...
Finished Volume 2...
Finished Volume 3...
Finished Volume 4...
Finished Volume 5...
Finished media-prep: Mon Oct 18 13:39:10 EDT 2004
################################################################
################################################################
Here we see the source and target directories indeed, have the same size and number of files. (Please ,disregard the "date sum" - this is experimental.).
Looking quickly at the results we see the following in the target directory (In this case Photo-archive/):
du -sh *
708M media-vol1
706M media-vol2
710M media-vol3
708M media-vol4
232M media-vol5
Note that the media-volumes are not exactly 700MB. media-prep doesn't do anything sophistocated when noting when to rototate from one volume to the next. Rather, it rotates as soon as the cumulative sum is greater than the specified media-capacity. For this reason, the media-capacity should always be specified a bit smaller than the actual media capacity to ensure that the final file in the volume doesn't inadvertantly make the volume larger than will actually fit on the media.
Here's what the first few lines of the first media-volume index looks like (without the checksumming feature):
# Photos-Archive --- vol-1.index
#
# For each file in the this volume, this index provides the filename
# and it's last modification date-time stamp (mtime).
# DATE FILENAME
Thu Sep 28 16:56:18 2000 /Kokua/P8080001.JPG
Thu Sep 28 16:56:34 2000 /Kokua/P8080002.JPG
Thu Sep 28 16:56:54 2000 /Kokua/P8080003.JPG
Thu Sep 28 16:57:14 2000 /Kokua/P8080004.JPG
That's it. Best of luck!