Running scripts on existing entities

Islandora Workbench provides a task, run_scripts, that enables users to run custom scripts on specific nodes, media, or taxonomy terms identified in a CSV file. This ability extends Workbench's functionality to include whatever can be packaged into a drop-in script.

Use cases

Even though Islandora Workbench can run scripts on newly created or updated entities using its post-action hooks, run_scripts tasks can run scripts on existing nodes, media, and taxonomy terms regardless of how or when they were created. Some use cases for this ability include:

rendering a node to warm the web server's cache, or to pregenerate Cantaloupe tiles for images so they appear to the user faster
performing automated or large-scale quality checks, for example to ensure that all the nodes being checked have the expected derivative media
exporting and packaging content in ways that cannot be done using Workbench's built-in export functionality, such as creating Bags or other types of structured content packages
integration into workflows that required custom representations of Islandora objects for use in other applications such as Archivematica.

Workbench configuration

A Workbench configuration file for the run_scripts task looks like this:

task: run_scripts
host: https://islandora.dev
username: admin
password: password
input_csv: nodes_to_process.csv

# The following config settings are specific to the run_scripts task.
# "run_scripts_entity_type" and "run_scripts" are required.
run_scripts_entity_type: node
run_scripts:
  - /home/mark/hacking/islandora_workbench/scripts/script_to_run_node_sample.py
  - /usr/bin/sh /home/mark/hacking/islandora_workbench/scripts/script_to_run_node_sample_2.sh
  # If on Windows, paths to scripts look like this.
  # -'C:\LocalApps\scripts\script_to_run_node_sample.py'
  # -'C:\LocalApps\scripts\script_to_run_node_sample_2.bat'
run_scripts_threads: 5
run_scripts_log_script_output: false

run_scripts_entity_type (required): one of "node", "media", or "term".
run_scripts (required): a list containing the absolute paths to the scripts. Scripts can be anywhere; they do not need to be in the workbench/scripts directory.
run_scripts_threads (optional): number of concurrent threads to use. Default is 1.
run_scripts_log_script_output (optional): whether or not to log the output of scripts in the Workbench log file. Default is true, which you may want to set to false if your script writes its own log.

If a script contains a shebang line (in the case of the above Python script, #!/usr/bin/env python), you should not need to explicitly provide an interpreter for the script. However, if the script does not, you must prepend the script's path with the path to the applicable interpreter. Workbench will look for a space in the run_scripts entry and consider the string on the left of the space the path to be the interpreter and the sting on the right of the space to be the script.

Scripts are run in the order they are listed in within run_scripts. Workbench will apply an earlier script to all IDs in the CSV before moving on to the next script. This means that the first script in the list could fetch content from Drupal and write it to a location, and the next script could use the previous script's output as its input.

The "run_scripts_threads" setting

Workbench can execute scripts on groups of IDs concurrently, thereby speeding up processing. For example, if you set run_scripts_threads to 5, Workbench will process the IDs in the input CSV in groups of 5 in parallel.

However, keep in mind that while setting this config option higher than 1 will shorten the total amount of time it takes your scripts to process all IDs, doing so will probably add additional load to Drupal, which will slow it down. Experimentation and trial and error will likely be required to find the number of threads to use for a given script.

Also note that if a group of threads executes in parallel, the order in which each one completes executing within that group is not guaranteed. In most cases this will not be an issue since these scripts process a single Drupal node, media, or term that is independent of the others in the group.

The input CSV

The input CSV requires only a single column named either node_id, media_id, or term_id, depending on the value of run_scripts_entity_type. Because run_scripts tasks ignore all other columns, you can use the CSV generated by Workbench's rollback functionality, or a CSV created during an export_csv task. You can use a Google Sheet or an Excel file as the input CSV. Values in the node_id/media_id/term_id can be simple numeric IDs or full URLs/aliases.

You can comment out rows of your CSV to skip processing them, process row ranges, and process a list of specific rows. All of these options are available:

node_id
3684
#3690
#3692
3723

csv_start_row: 10
csv_stop_row: 15

csv_rows_to_process: ["test_001", "test_007", "test_103"]

csv_rows_to_process: process_these.txt

Writing scripts

Scripts can be written in any language. They are automatically passed two arguments, 1) the absolute path to the Workbench config file and 2) an entity ID from the CSV file, in that order. The logic in scripts applies to the single entity ID provided as the second argument -- Workbench does the looping through the list of entity IDs from the CSV for you.

Also, Workbench receives a script's exit code, and logs success or failure based on the exit code -- an exit code of 0 indicates success (which is a general convention for command-line programs with very few exceptions), and a non-0 exit code signals failure. If you are including exception-handling code in your scripts that makes them exit on an exception or error (as is illustrated in the example script below), be sure to exit with a non-zero code so Workbench can reliably detect the failure. Otherwise, Workbench will think the script executed successfully.

Workbench comes with two sample scripts, scripts/script_to_run_node_sample.py and scripis/script_to_run_node_sample_2.sh.

The second script's purpose is to illustrate registering multiple scripts in the run_scripts config setting; it merely prints and logs the ID it received as an argument.

The first script, script_to_run_node_sample.py, illustrates the minimal functionality that a script can take. Some of the points made above are expanded on in the inline comments:

#!/usr/bin/env python

"""Sample script to illustrate Workbench's "run_scripts" task."""

import sys
import logging
import json
import requests
from ruamel.yaml import YAML

# All scripts registered in "run_scripts" are passed the path to the Workbench config file
# and a single entity ID, in that order.
config_file = sys.argv[1]
node_id = sys.argv[2]

# Note that like hook scripts, scripts registered in "run_scripts" only have access
# to configuration values defined in the Workbench config file, not to default values.
# Therefore, if you want to reuse the Workbench config file in your scripts, the config
# file should include settings that would otherwise have the default values.
yaml = YAML()
with open(config_file, "r") as stream:
    config = yaml.load(stream)

# Scripts should do their own logging, although if the Workbench config setting is
# "run_scripts_log_script_output" is set to true (its default value), the output
# of your scripts will be logged in the Workbench log file. Set "run_scripts_log_script_output"
# to false to omit scripts' output from the Workbench log.
logging.basicConfig(
    filename="script_to_run_sample.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%d-%b-%y %H:%M:%S",
)

# The demonstration purpose of this script is to fetch each node's title and print/log it,
# using the "host" setting defined in the Workbench config file.
requests.packages.urllib3.disable_warnings()
url = f'{config["host"].rstrip("/")}/node/{node_id}?_format=json'
result = requests.get(url, verify=False)
try:
    node = json.loads(result.text)
    title = node["title"][0]["value"]
    logging.info(f'Title is "{title}".')
    print(f'Title is "{title}".')
except Exception as e:
    logging.error(
        f"Could not retrieve data from {url}, HTTP response code was {result.status_code} {e}."
    )
    # Scripts should always exit with a non-0 code on failure so Workbench can detect the failure.
    sys.exit(1)

Note that scripts must be set to executable, and that they must either contain the shebang line applicable to the language the script is written in, or if they do not, the interpreter to use to run the script must be explicitly included in its run_scripts entry as illustrated above. Windows .bat scripts require neither a shebang line nor an explicit interpreter.

Also, because registered scripts take the absolute path to the Workbench configuaration file as their first command-line argument and an entity ID as their second argument, it is possible to execute these scripts on the command-line, like this:

./script_to_run_node_sample.py /path/to/myworkbenchconfig.yml 2316

Since the scripts will have no dependencies on Workbench itself, they will execute as standalone scripts.

An example

The Cantaloupe Image Server used by Islandora generates tiles (subsections of image data) from a source image file to send to an IIIF viewer such as Mirador or OpenSeadragon. It does this on the first request for the image. For large images, the initial tiling can take more time than users probably will wait, so it is useful to pregenerate the tiles so that when the node wrapping the viewer is seen by a human, the tiles appear quickly.

The tile_warmer.py script included in Workbench's scripts directory uses a headless web browser, Selenium, to render Mirador (thereby triggering the tile generation) and a computer-vision technology called OpenCV to detect whether Mirador has displayed the image data or contains only a gray area where the image should be. The latter occurs if Cantaloupe has timed out while generating its tiles, or an error has occured preventing the image data from rendering in Mirador.

The script pauses for a configurable amount of time (say 35 seconds) to allow Cantaloupe to generate its tiles, and then takes a screenshot of the node's page that it applies computer vision to to check the success of the tile generation. The screenshots look like this (using a small set of three nodes in this example):

Some simple screenshots

The Workbench configuration file that runs this script looks like this:

task: run_scripts
host: https://digital.lib.sfu.ca
username: xxxxxxxxx
password: xxxxxxxxx

input_csv: tile_warmer_nids.csv
run_scripts_entity_type: node
run_scripts_threads: 2
run_scripts:
  - /usr/bin/python /home/mark/hacking/islandora_workbench/scripts/tile_warmer.py

In the directory illustrated above, the screenshot on the left and the one on the right show that Mirador contains complete image data. Those tiles in the Mirador viewer will render faster for the first human viewer since the tiles have been fully pregenerated by the tile_warmer.py script. The middle image appears incomplete. The script can detect this via OpenCV and log that it has detected a large gray area. Running the script on that node again will likely generate all of its tiles.

A possible workflow for running the script over a collection of image or page nodes is to use the rollback file generated during a create task as the input CSV for the tile_warmer.py script.

A couple of things to note about this example:

The tile_warmer.py script uses several Python libraries you will need to install - selenium, opencv, and numpy.
Setting the run_scripts_threads to more than 2 or 3 can put considerable strain on your Islandora server.