Skip to content

Fixity checking

Transmission fixity

Islandora Workbench enables transmission fixity validation, which means it can detect when files are not ingested into Islandora intact, in other words, that the files became corrupted during ingest. It does this by generating a checksum (a.k.a. "hash") for each file before it is ingested, and then after the file is ingested, Workbench asks Drupal for a checksum on the file generated using the same hash algorithm. If the two checksums are identical, Workbench has confirmed that the file was not corrupted during the ingest process. If they are not identical, the file became corrupted.

This functionality is available within create and add_media tasks. Only files named in the file CSV column are checked.

To enable this feature, include the fixity_algorithm option in your create or add_media configuration file, specifying one of "md5", "sha1", or "sha256" hash algorithms. For example, to use the "md5" algorithm, include the following in your config file:

fixity_algorithm: md5

Comparing checksums to known values

Comparison to known checksum values can be done both during the transmission fixity check, above, and during Workbench's --check phase, as described below.

If you want Workbench to compare the checksum it generates for a file to a known checksum value (for example, one generated by a platform you are migrating from, or during some other phase of your migration workflow), include a checksum column in your create or add_media CSV input file. No further configuration other than indicating the fixity_algorithm as described above is necessary. If the checksum column is present, Workbench will compare the hash it generates with the value in that column and report matches and mismatches. Note that the checksum in your CSV must have been generated using the same algorithm specified in your fixity_algorithm configuration setting.

Validating checksums during --check

If you have pregenerated checksum values for your files (as described in the "Comparing checksums to known values" section, above), you can tell Workbench to compare those checksums with checksums during its --check phase. To do this, include the following options in your create or add_media configuration file:

fixity_algorithm: md5
validate_fixity_during_check: true

You must also include both a file and checksum column in your input CSV, and ensure that the checksums in the CSV column were generated using the algorithm named in the fixity_algorithm setting. Results of the checks are written to the log file.

Some things to note:

  • Fixity checking is currently only available to files named in the file CSV column, and not in any "additional files" columns.
  • For the purposes of fixity verification, md5 is sufficient. Using it is also faster than either sha1 or sha256. However, you will need to use sha1 or sha256 if your pregenerated checksums were created using those algorithms.
  • If you are including pregenerated checksum values in your CSV file (in the checksum column), the checksums must have been generated using the same has algorithm indicated in your fixity_algorithm configuration setting: "md5", "sha1", or "sha256". If the existing checksums were generated using a different algorithm, all of your checksum comparisons will fail.
  • Workbench logs the outcome of all checksum comparisons, whether they result in matches or mismatches.
  • If there is a mismatch, Workbench will continue to ingest the file and create the accompanying media. For this reason, it is prudent to perform your checksum validation during the --check phase. If any comparisons fail, you have an opportunity to replace the files before committing to ingesting them into Drupal.
  • Validation during --check happens entirely on the computer running Workbench. During --check, Workbench does not query Drupal for the purposes of checksum validation, since the files haven't yet been ingested into Islandora at that point.
  • Fixity checking slows Workbench down (and also Drupal if you perform transmission fixity checks) to a certain extent, especially when files are large. This is unavoidable since calculating a file's checksum requires reading it into memory.