Fixity checking
Transmission fixity
Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. "hash") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted.
This functionality is available within create
and add_media
tasks. Only files named in the file
CSV column are checked.
To enable this feature, include the fixity_algorithm
option in your create
or add_media
configuration file, specifying one of "md5", "sha1", or "sha256" hash algorithms. For example, to use the "md5" algorithm, include the following in your config file:
fixity_algorithm: md5
Comparing checksums to known values
Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check
phase, as described below.
If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum
column in your create
or add_media
CSV input file. No further configuration other than indicating the fixity_algorithm
as described above is necessary. If the checksum
column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm
configuration setting.
Validating checksums during --check
If you have pregenerated checksum values for your files (as described in the "Comparing checksums to known values" section, above), you can tell Workbench to compare those checksums with checksums during its --check
phase. To do this, include the following options in your create
or add_media
configuration file:
fixity_algorithm: md5
validate_fixity_during_check: true
You must also include both a file
and checksum
column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm
setting. Results of the checks are written to the log file.
Some things to note:
- Fixity checking is currently only available to files named in the
file
CSV column, and not in any "additional files" columns. - For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms.
- If you are including pregenerated checksum values in your CSV file (in the
checksum
column), the checksums must have been generated using the same has algorithm indicated in yourfixity_algorithm
configuration setting: "md5", "sha1", or "sha256". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail. - Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches.
- If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the
--check
phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal. - Validation during
--check
happens entirely on the computer running Workbench. During--check
, Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point. - Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.