Optimizing your use of Workbench
This page lists a number of things you can do to optimize your use of Islandora Workbench. Following these recommendations will decrease the liklihood that you will encounter problems, and increase value you get out of Workbench.
If you have found a way to improve your use of Workbench that you think others might want to know about, share it!
Keep your copy of Workbench up to date
Islandora Workbench is under active development, both in terms of bug fixes and new features. It's good practice to update your copy of Workbench every couple of weeks.
Protect user credentials in your config files
Workbench requires the credentials of a Drupal user. You should consult with your local IT group to determine which option for managing the password of this user is their preference. You don't want Islandora Workbench to become a security vulnerability.
Run --check
Running Workbench with its--check
argument tells it to confirm that it can connect to Drupal, and to confirm that your input data is formatted as expected. It also performs validation on values in fields that have specific requirements, such as EDTF and typed relation fields. If --check
passes, or only flags expected issues (such as files you know are missing), you reduce the likelihood that Workbench will encounter problems when you remove --check
and commit your changes to your Islandora repository.
It can sometimes take several runs of --check
for you to find and resolve all the issues it finds. The extra time this takes is worth it. You may want to try the perform_soft_checks: true
config setting to tell --check
to be more forgiving.
Don't forget to always look at your Workbench log file after running --check
. It will likely contain useful detail that will help you identify potential problems.
Run Workbench on a small sample first
Even if --check
doesn't reveal any problems, running Workbench using a small set of sample input is a very good idea. Workbench provides a number of options for doing this, including processing only the first few rows of your input CSV, processing only the last few rows, processing specific rows, or processing only rows that meet a certain condtion. This page provides more information on how to use these options.
Break large jobs up into smaller jobs
Workbench is capable of processing a very large set of input data, but many users prefer to break up large jobs into several smaller subsets to decrease the chance of overloading their Drupal server, or to avoid possible network glitches, etc. Splitting up your input CSV into smaller files is one way of doing this, but Workbench's "Using CSV row ranges" or "Processing specific CSV rows" features let you use a single input CSV and define the subsets in your configuration file.
Defer Solr indexing
Configuring Solr to not index newly added or updated items immediately can result in noticably faster create
and update
jobs. To do this, visit /admin/config/search/search-api/index/default_solr_index/edit
and in the "Index options" fieldset, make sure "Index items immediately" is unchecked and save the form. Indexing will happen during cron runs. Thanks to Born-Digital for suggesting this!
Create a View that allows you to use the "Checking if nodes already exist" feature
Creating a View that you can use in conjunction with Workbench's node_exists_verification_view_endpoint
config setting lets rerun the same job, using the same input CSV, multiple times without worrying about ingesting duplicate nodes. You will find this feature useful if your job crashes part way through. More information on creating this View, and confguring Workbench to use it, is available here.
Use adaptive pause
Adaptive pause is an effective way to decrease the likelihood that Workbench will time out due to excessive load on the Drupal server. Since it only pauses Workbench when needed, the few times it will introduce a small pause in execution time are well worth the increased reliability it introduces.
Assume you will need the node IDs of items you create later
Having the node IDs generated during create
tasks, and being able to relate them to a given Workbench job, is very useful, for example if you need to run an remedial update
or add_media
task on a large set of new nodes. Workbench provides a few ways to provide those node IDs.
1) The CSV ID to node ID map
Workbench's CSV ID to node ID map can make recovering from an interrupted create
task very easy. You should configure your create
tasks to use a directory for the map's database file that is not your system's temporary directory (which is the default). This is done using the csv_id_to_node_id_map_dir
configuration setting. More information is available here.
2) Rollback CSV files
You should always assume that you will need to roll back a set of nodes created during create
tasks. Even if you don't need to do that, having the new node IDs easily accessible can be very convenient, for example if you need to do a remedial update
task on the new nodes. Using the rollback's configuration settings, particulary the following two, can ensure that you have this data intact and ready to use:
timestamp_rollback: true
rollback_file_include_node_info: true
3) The output_csv
setting
During create
tasks, you can use the output_csv
setting to tell Workbench to write a CSV file that contains the new node IDs. More information is available here.
Rely on recovery mode
Once you have configured Workbench to look for the CSV ID to node ID map in a stable location, you ready to use recovery mode! You may never need to use it, but like insurance, you don't need it until you need it.
Create local shortcuts and aliases
Some users create aliases for Workbench commands. How to do this depends on your operating system, but on Unix-based systems such as MacOS and Linux, a single command such as alias wb='workbench --config=config.yml
can let you run wb
many times without having to use the full command. Thanks to rosiel for this tip.
Specify configuration settings on the command line
Managing small changes in a configuration file over multiple runs of Workbench can be tedious. Some configuration settings can be applied at the command line, which makes editing the configuration file unnecessary.
Balance speed optimizations with convenience
Consider techniques for speeding up Workbench. However, they all come at the cost of some convenience. You decide which ones are worth it!
Make Workbench tell you about its progress
Watching lines of output say that a node has been created is BORING. Try adding show_percentage_of_csv_input_processed: true
or progress_bar: true
to your config file to make using Workbench a little less boring.
Use the protected_vocabularies
setting
allow_adding_terms
is an essential Workbench config setting. But, it can lead to problems. For example, you don't want a typo in the values in your field_model
CSV column to create a new Islandora Model! To be very safe, you should include the protected_vocabularies
setting in your configuration files to exclude which vocabularies Workbench will add new terms to. At Simon Fraser University Library, we use:
protected_vocabularies:
- islandora_model
- islandora_display
- islandora_media_use
- issuance_modes
- resource_types
- physical_form
- genre
- copyright
- terms_of_use
- concerns_corrections