5. How it works

ISiBackup is not magic. It is, more or less, a wrapper script for collection, compression and encryption tools. ISiBackup runs in five phases:

  1. Initialisation, configuration file and command line options interpretation

  2. Directory selection

  3. For each directory

    • File selection

    • Collection

    • Compression

    • Encryption

  4. Statistics

5.1. Initialisation

The initiaisation sequence is approximately:

5.2. Directory selection

This takes place inside the do_backup function.

5.3. Directory Processing

Now the backup of the individual directories starts, using the steps of "file selection", "collection", "compression" and "encryption".

5.3.1. File selection

File selection determines what files to backup.

  • In case of a differential backup, we have to backup only files that have changed after the last full backup. As file selecion uses the find comamnd as well, we add the parameters of -newer {Date}, where {Date} is the date of the last full backup (which was determined during intialisation).

  • Resetting the statistic counters allowes for a progress indication later.

  • Now the loop through the selected directories begins.

  • First, the current statistic values are fetched and progress is shown.

  • Source and target directory are stripped of double slashes and characters incompatible with the VFAT filesystem are converted (recodeVFATChars) in order to prevent later write errors [This is due to the fact that at IMSEC, early target directories were on VFAT filesystems]. The only character currently being replaced is the colon (:). The replacement is the sequence %1A.

  • The target directory is created. Before doing so, it checked wheter the name fits inside the name length restrictions. This is given as 220 in MAX_PATHLEN, but may be overridden inside the configuration files. Be aware that by placing the backup directory inside the tree somewhere, the path length grows by the number of characters in that path! The limit is not as far away as it may seem. Paths longer than this limit will just be skipped and will not be part of the backup (however, a warning will be displayed).

5.3.2. Collection

  • The function createFileList is used to determine the files to be backuped inside this directory.

  • In case the list of include file patterns (from include_files.lst) is empty, all files in the directory will be found, otherwise a loop thorugh all the include patterns makes sure just those files are selected that match. In any case, the "-newer" restrictino from above applies in case of a differential backup.

  • The next step is to exclude all files from the file exclusion list in exclude_files.lst. If no pattern was defined, no exclusion takes place, i.e. the file selection remains the same.

  • If the file selection renders an empty list (no inclusion, or too many exclusions, or no newer files in case of differential backup, the actual process of backing up the files is skipped, and an empty target directory remains (it will be deleted later).

  • The normal mode of operation is to pack the files inside the directory into one single archive, which is called "collect". The program tar is a typical collector.

    But there is a a second operating mode called ""separate". When the "collect" runs into size limitations (imagine a directory with two or more files of 1.5 GB each), the mode is switched from "collect" to handling a per-file basis (mode "separate"), that is, each file is being put into its own archive instead of being collected with the others in this directory. So in "collect" we end up with one file per directory, whereas in "separate", we get as many files as there were before. The definition of "site limitations" is given in the constant MAX_FSFILESIZE, which defaults to 2 GB, but can be overridden in any of the two configuration files.

    To check if we run into size trouble, we first need to count the files and sum their sizes. But let's look into that step by step.

  • As explained above, the DIRLIST_ENTRY is used to backup the directory information. This needs space as well and goes into the size calculations with DEF_PACKEDBLOCKSIZE per directory.(DEF_PACKEDBLOCKSIZE defaults to 1)

  • The the actual size calculation is done in the function countFiles. Also here, we check if there are files among them which the selected packer cannot pack (e.g. zoo cannot handle symbolic links), and to do so, we filter the file list again by file type and skip the files that cannot be handled (a warning is issued).

    The file size calculations erly on the output of stat. In order not to stress the internal bash arithmetics too much, the size is rounded to the kilobyte. Anyway, integrity checks for arithmetic overflow have been added, and detecting such an error does not break the backup process (anymore); it just sets the maximum size requirement for the target directory (which is MAX_FSFILESIZE).

  • So if the target path length is short enough (see restriction futher up), the calculated size determines if we run in "collect" or "separate" mode. /para>

  • Special attention is needed when we backup the root directory. Normally, the name of the individual achives is the same as the directory that has been backed up insied (except for VFAT charater conversions). In the case of the root directory, there is no name, so we call that one "rootdir". This must lead to trouble if there actually happenes to be a directory called "rootdir", but this is an uncommon case that has not been handled yet.

  • The next step is to loop over all the output files. This is kind of a trick, as in "collect" mode there is just one output file, named like the directory. Just in the "separate" mode there several files.

  • The target file name gets recoded to prevent any characters not compatible with the VFAT filesystem.

  • Warning

    The availble space on the target is checked. The conservative assumption is that when there is less space left on the target than the total size of the (uncompressed) directory, space is too tight, and the backup aborts immediately. There is no way to override thaqt, but it has proven to be a good indicator of a nearly full backup drive, and hence this is a good thing as it explicitely requires measures.

  • Next, the target file extension is determined. This follows the usual convention that each processing tool adds its extension to the file (e.g. using a sequence of cpio and bzip2 results in the target file having the extension of .cpio.bz2).

  • There is another restriction that can lead to abort of the ISIBackup script: Normally, file operations are not executed on the source drive nor on the target drive, but on a temporary directory that defaults to /var/tmp. As this temporary directory may need to hold the "collected" file (e.g. collected by cpio) as well as the compressed file (e.g. bzip2), and as compression may not reduce the size, the temporary directory must have 220% of the total size of the directory contents available (100% of the added source file sizes for collection, up to 100% for the compressed file and a 20% reserve). In case that fails, it checks whether that much space is available in the target directory, in which case file operations are done there. If neither has that much space, ISiBackup aborts. While doing file operations on the target drive works, it may be very low in performance (e.g. on a network-mounted target directory, or on a removable media target directory).

  • Next, the name of then encrypted target file is determined. If encryption is enabled and set to gpg, for example, then the extension .gpg will be added to the resulting file name.

  • As creating a collected, compressed, and encrypted file can take up some time, that time can be saved in case there already is a target file with the same content. This is called "in-place refresh of a previously created backup". But how can one determine if the contents of an encrypted file are the same as the ones in the source directory? One can't, so we need another trick here. When backing up files, ISIBackup sets the file date to the date of the newsest file included in the backup (e.g. using the zip -o option). If all of the file in the source are older than the date of the pre-existing file, backing it up the directory can be skipped as the target still holds a complete backup file. If there is a newer file, and if the file was not encrypted, then some of the compression programs offer an update open (such as zip -u) which can be used to further decrease the time needed for backup. As a result, we have a "skip", an "update" and a "create" mode for the abckup file.

5.3.3. Compression

  • The next step is to catually produce the collected file (function createPackedArchive). It has to be noted that some programs are just "collectors", susch as tar and cpio, while others are just packers, like zip and bzip2, while even others have both functions integrated in a way that one can hardly separate them, such as zip. Here, collection and compression are done in two integrated steps, resulting in a collected, compressed archive. Additionally, the size of the input directory is recorded for statistics.

5.3.4. Encryption

  • The collected, compressed file is then encyrpted (function createCryptedFile). Encryption is done to the configured backup key. Independently of the fact if encryption is enabled or disabled, the size of the resulting output file is recorded for statistics.

  • If file operations were not on the target directory, the file is transferred there from the temporary working directory.

  • This concludes the actual backup process, which is repeated for each output file, and for each input directory. The rest ist outputting the statistical information, writing the termination messages to log and stdout, and cleaning up the various temporary files that were used. Also, a time stamp is used to record the date of the backup formally; this is later needed for differential backups.