Running scripts on existing entities
Islandora Workbench provides a task, run_scripts
, will enables users to run custom scripts on specific nodes, media, or taxonomy terms identified in a CSV file. This ability extends Workbench's functionality to include whatever can be packaged into a script.
Use cases
Even though Islandora Workbench can run scripts on newly created or updated entities using its post-action hooks, run_scripts
tasks can run scripts on existing nodes, media, and taxonomy terms regardless of how or when they were created. Some use cases for this ability include:
- rendering a node to warm the web server's cache, or to pregenerate Cantaloupe tiles for images so they appear to the user faster
- performing automated or large-scale quality checks, for example to ensure that all the nodes being checked have the expected derivative media
- exporting and packaging content in ways that cannot be done using Workbench's built-in export functionality, such as creating Bags or other types of structured content packages
- integration into workflows that required custom representations of Islandora objects for use in other applications such as Archivematica.
Workbench configuration
A Workbench configuration file for the run_scripts
task looks like this:
task: run_scripts
host: https://islandora.dev
username: admin
password: password
input_csv: nodes_to_process.csv
# The following config settings are specific to the run_scripts task.
# "run_scripts_entity_type" and "run_scripts" are required.
run_scripts_entity_type: node
run_scripts:
- /home/mark/hacking/islandora_workbench/scripts/script_to_run_node_sample.py
- /home/mark/hacking/islandora_workbench/scripts/script_to_run_node_sample_2.py
# If on Windows, paths to scripts look like this.
# -'C:\LocalApps\scripts\script_to_run_node_sample.py'
# -'C:\LocalApps\scripts\script_to_run_node_sample_2.py'
run_scripts_threads: 5
run_scripts_log_script_output: false
run_scripts_entity_type
(required): one of "node", "media", or "term".run_scripts
(required): a list containing the absolute paths to the scripts. Scripts can be anywhere; they do not need to be in theworkbench/scripts
directory.run_scripts_threads
(optional): number of asynchronous threads to use. Default is 1.run_scripts_log_script_output
(optional): whether or not to log the output of scripts in the Workbench log file. Default istrue
, which you may want to set tofalse
if your script writes its own log.
Scripts are run in the order they are listed in within run_scripts
. Workbench will apply an earlier script to all IDs in the CSV before moving on to the next script. This means that the first script in the list could fetch content from Drupal and write it to a location, and the next script could use the previous script's output as its input.
The "run_scripts_threads" setting
Workbench can execute scripts on groups of IDs asynchronously, speeding up processing substantially. For example, if you set run_scripts_threads
to 5, Workbench will process the IDs in the input CSV in groups of 5 in parallel.
However, keep in mind that while setting this config option higher than 1 will shorten the total amount of time it takes your scripts to process all IDs, doing so will probably add additional load to Drupal, which will slow it down. Experimentation and trial and error will likely be required to find the number of threads to use for a given script.
Also note that if a group of threads executes in parallel, the order in which each one completes executing within that group is not guaranteed. In most cases this will not be an issue since these scripts process a single Drupal node, media, or term that is independent of the others in the group.
The input CSV
The input CSV requires only a single column named either node_id
, media_id
, or term_id
, depending on the value of run_scripts_entity_type
. Because run_scripts
tasks ignore all other columns that might be present, you can use the CSV generated by Workbench's rollback functionality, or a CSV created during an export_csv
task.
Writing scripts
Scripts can be written in any language. They are automatically passed two arguments, 1) the path to the Workbench config file and 2) an entity ID from the CSV file, in that order. The logic in scripts applies to the single entity ID provided as the second argument -- Workbench does the looping through the list of entity IDs from the CSV for you.
Also, Workbench receives a script's exit code, and logs success or failure based on the exit code -- an exit code of 0
indicates success (which is a general convention for command-line programs with very few exceptions), and a non-0
exit code signals failure. If you are including exception-handling code in your scripts that makes them exit on an exception or error (as is illustrated in the example script below), be sure to exit with a non-zero code so Workbench can reliably detect the failure. Otherwise, Workbench will think the script executed successfully.
Workbench comes with two sample scripts, scripts/script_to_run_node_sample.py
and scripis/script_to_run_node_sample_2.py
.
The second script's purpose is solely to illustrate registering multiple scripts in the run_scripts
config setting; it merely prints and logs the ID it received as an argument.
The first script, script_to_run_node_sample.py
, illustrates the minimal functionality that a script can take. Some of the points made above are expanded on in the inline comments:
#!/usr/bin/env python
"""Sample script to illustrate Workbench's "run_scripts" task."""
import sys
import logging
import json
import requests
from ruamel.yaml import YAML
# All scripts registered in "run_scripts" are passed the path to the Workbench config file
# and a single entity ID, in that order.
config_file = sys.argv[1]
node_id = sys.argv[2]
# Note that like hook scripts, scripts registered in "run_scripts" only have access
# to configuration values defined in the Workbench config file, not to default values.
# Therefore, if you want to reuse the Workbench config file in your scripts, the config
# file should include settings that would otherwise have the default values.
yaml = YAML()
with open(config_file, "r") as stream:
config = yaml.load(stream)
# Scripts should do their own logging, although if the Workbench config setting is
# "run_scripts_log_script_output" is set to true (its default value), the output
# of your scripts will be logged in the Workbench log file. Set "run_scripts_log_script_output"
# to false to omit scripts' output from the Workbench log.
logging.basicConfig(
filename="script_to_run_sample.log",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%d-%b-%y %H:%M:%S",
)
requests.packages.urllib3.disable_warnings()
# The demonstration purpose of this script is to fetch each node's title and print/log it,
# using the "host" setting defined in the Workbench config file.
url = f'{config["host"].rstrip("/")}/node/{node_id}?_format=json'
result = requests.get(url, verify=False)
try:
node = json.loads(result.text)
title = node["title"][0]["value"]
logging.info(f'Title is "{title}".')
print(f'Title is "{title}".')
except Exception as e:
logging.error(
f"Could not retrieve data from {url}, HTTP response code was {result.status_code} {e}."
)
# Scripts should always exit with a non-0 code on failure so Workbench can detect the failure.
sys.exit(1)
Note that scripts must be set to executable, and that they must contain the shebang line (in the case of the above Python script, #!/usr/bin/env python
) as is the convention of the language the script is written in.
Also, because registered scripts take the path to the Workbench configuaration file as their first command-line argument and an entity ID as their second argument, it is possile to execute the scripts on the command-line, like this:
./script_to_run_node_sample.py path/to/myworkbenchconfig.yml 2316
Since the script will have no dependencies on Workbench itself, it will execute as expected.