Creating paged, compound, and collection content
Islandora Workbench provides three ways to create paged and compound content:
- using a subdirectory structure to define the relationship between the parent item and its children
- using page-level metadata in the CSV to establish that relationship
- using a secondary task.
Enable this method by including
paged_content_from_directories: true in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata. This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Each parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.:
id,title,field_model,field_display_hints book1,How to Use Islandora Workbench like a Pro,28,2 book2,Using Islandora Workbench for Fun and Profit,28,2
Unlike every other Islandora Workbench "create" configuration, the metadata CSV should not contain a
file column. This means that content created using this method cannot be created using the same CSV file as other content.
Each parent's pages are located in a subdirectory of the input directory that is named to match the value of the
id field of the parent item they are pages of:
books/ ├── book1 │ ├── page-001.jpg │ ├── page-002.jpg │ └── page-003.jpg ├── book2 │ ├── isbn-1843341778-001.jpg │ ├── using-islandora-workbench-page-002.jpg │ └── page-003.jpg └── metadata.csv
The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash (
-), although you can use another character by setting the
paged_content_sequence_separator option in your configuration file. For example, using the filenames for "book1" above, the sequence of "page-001.jpg" is "001". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of "isbn-1843341778-001.jpg" for "book2" is also "001". Workbench takes this sequence number, strips of any leader zeros, and uses it to populate the
field_weight in the page nodes, so "001" becomes a weight value of 1, "002" becomes a weight value of 2, and so on.
Titles for pages are generated automatically using the pattern
, page +
sequence_number, where "parent title" is inherited from the page's parent node and "sequence number" is the page's sequence. For example, if a page's parent has the title "How to Write a Book" and its sequence number is 450, its automatically generated title will be "How to Write a Book, page 450". You can override this pattern by including the
page_title_template setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is
'$parent_title, page $weight'. There are only two variables you can include in the template,
$weight, although you do not need to include either one if you don't want that information appearing in your page titles.
Finally, even though only minimal metadata is assigned to pages using this method (i.e., the automatically generated title and Islandora model), you can add additional metadata to pages using a separate
Important things to note when using this method:
- To use this method of creating paged content, you must include
paged_content_page_model_tidin your configuration file and set it to your Islandora's term ID for the "Page" term in the Islandora Models vocabulary (or to
- The Islandora model of the parent is not set automatically. You need to include a
field_modelvalue for each item in your CSV file.
- You should also include a
field_display_hintscolumn in your CSV. This value is applied to the parent nodes and also the page nodes, unless the
paged_content_page_display_hintssetting is present in you configuration file. However, if you normally don't set the "Display hints" field in your objects but use a Context to determine how objects display, you should not include a
field_display_hintscolumn in your CSV file.
idcan be defined as another field name using the
id_fieldconfiguration option. If you do define a different ID field using the
id_fieldoption, creating the parent/paged item relationships will still work.
- The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the
paged_content_page_content_typesetting in your configuration file.
With page/child-level metadata
Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's
file column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with
parent_id values (the pages/children) with that are the same as the
id value of another item (the parent). For this to work, your CSV file must contain a
parent_id field plus the standard Islandora fields
field_model (the role of these last three fields will be explained below). The
id field is required in all CSV files used to create content, so in this case, your CSV needs both an
id field and a
The following example illustrates how this works. Here is the raw CSV data:
id,parent_id,field_weight,file,title,field_description,field_model,field_member_of 001,,,,Postcard 1,The first postcard,28,197 003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29, 004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29, 002,,,,Postcard 2,The second postcard,28,197 006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29, 007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29,
The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet:
The data contains rows for two postcards (rows with
id values "001" and "002") plus a back and front for each (the remaining four rows). The
parent_id value for items with
id values "003" and "004" is the same as the
id value for item "001", which will tell Workbench to make both of those items children of item "001"; the
parent_id value for items with
id values "006" and "007" is the same as the
id value for item "002", which will tell Workbench to make both of those items children of the item "002". We can't populate
field_member_of for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children.
In this example, the rows for our postcard objects have empty
file columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in
field_member_of, which is the node ID of the "Postcards" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their
field_weight field, and they have values in their
file column because we are creating objects that contain image media. Importantly, they have no value in their
field_member_of field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's
field_member_of dynamically, just after its parent node is created.
Some important things to note:
- Currently, you need to include the option
allow_missing_files: truein your configuration file when using this method to create paged/compound content. See this issue for more information.
idcan be defined as another field name using the
id_fieldconfiguration option. If you do define a different ID field using the
id_fieldoption, creating the parent/child relationships will still work.
- The values of the
parent_idcolumns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields.
- The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. (
--checkwill tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child's
field_member_ofwill be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items.
- Currently, you must include values in the children's
field_weightcolumn (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue).
- Currently, Islandora model values (e.g. "Paged Content", "Page") are not automatically assigned. You must include the correct "Islandora Models" taxonomy term IDs in your
field_modelcolumn for all parent and child records, as you would for any other Islandora objects you are creating. Like for
field_weight, it may be possible to automatically generate values for this field (see this issue).
Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your "create" tasks accordingly. For example:
- If you want to use a single "create" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages.
- If you would rather use multiple "create" tasks, you can create all your collections first, then, in subsequent "create" tasks, use their respective node IDs in the
field_member_ofCSV column for their members. If you use a separate "create" task to create members of a single collection, you can define the value of
field_member_ofin a CSV field template.
- If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book nodes node ID in their
field_member_ofcolumn). Or, you could run a separate "create" task for each book, and use a CSV field template containing a
field_member_ofentry containing the book item's node ID.
- For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent "create" task for both newspaper issues and pages. In this task, the
field_member_ofcolumn in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blank
parent_idusing the parent issue's
Using a secondary task
You can configure Islandora Workbench to execute two "create" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents.
The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's
parent_id values to their parent's
id values, much the same way as in the "With page/child-level metadata" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and in fact, their own configuration file). Each task is a standard Islandora Workbench "create" task, joined by one setting in the primary's configuration file.
In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children:
As you can see, values in the
parent_id column in the secondary CSV reference values in the
id column in the primary CSV:
parent_id 001 in the secondary CSV matches
id 001 in the primary,
parent_id 003 in the secondary matches
id 003 in the primary, and so on.
You configure secondary tasks by adding the
secondary_tasks setting to your primary configuration file, like this:
task: create host: "http://localhost:8000" username: admin password: islandora # This is the setting that links the two configuration files together. secondary_tasks: ['children.yml'] input_csv: parents.csv nodes_only: true
secondary_tasks setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named "children.yml") contains no indication that it's a secondary task:
task: create host: "http://localhost:8000" username: admin password: islandora input_csv: kids.csv csv_field_templates: - field_model: http://purl.org/coar/resource_type/c_c513
nodes_only setting in the above example primary configuration file and the
csv_field_templates setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ.
When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of
id + node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the
field_member_of values in the secondary task with the node IDs corresponding to the referenced primary
Some things to note about secondary tasks:
- Only "create" tasks can be used as the primary and secondary tasks.
- When you have a secondary task configured, running
--checkwill validate both tasks' configuration and input data.
- The secondary CSV must contain
field_member_ofmust be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, include
field_weightwith the appropiate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order).
- If a row in the secondary task CSV does not have a
parent_idthat matches an
idof a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so.
- As already stated, each task has its own configuration file, which means that you can specify a
content_typevalue in your secondary configuration file that differs from the
content_typeof the primary task.
- You can include more than one secondary task in your configuration. For example,
secondary_tasks: ['first.yml', 'second.yml']will execute the primary task, then the "first.yml" secondary task, then the "second.yml" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes.
Specifying paths to the python interpreter and to the workbench script
When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the "workbench" script is located.
The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute paths to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are:
An example of using these settings is:
secondary_tasks: ['children.yml'] path_to_python: '/usr/bin/python' path_to_workbench_script: '/home/mark/islandora_workbench/workbench'
The second situation is when using a secondary task when running Workbench in Windows and "python.exe" is not in the PATH of the user running the scheduled job. Specifying the absolute path to "python.exe" will ensure that Workbench can execture the secondary task properly, like this:
secondary_tasks: ['children.yml'] path_to_python: 'c:/program files/python39/python.exe' path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench'
Creating collections and members together
Using a variation of the "With page/child-level metadata" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members'
parent_id field to the collections'
id,parent_id,file,title,field_model,field_member_of,field_weight 1,,,A collection of animal photos,24,, 2,1,cat.jpg,Picture of a cat,25,, 3,1,dog.jpg,Picture of a dog,25,, 3,1,horse.jpg,Picture of a horse,25,,
The use of the
field_member_of fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in
field_weight empty, since Islandora collections don't use
field_weight to determine order of members. Collection Views are sorted using other fields.
Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the "Using a secondary task" method to ingest collections and members in the same Workbench job.
The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes:
|Method||Relationships created by||field_weight||Advantage|
|Subdirectories||Directory structure||Do not include column in CSV; autopopulated.||Useful for creating paged content where paged don't have their own metadata.|
|Parent/child-level metadata in same CSV||References from child's
||Column required; values required in child rows||Allows including parent and child metadata in same CSV.|
|Secondary task||References from
||Column and values recommended in secondary (child) CSV data||Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job.|
|Collections and members together||References from child (member)
||Column required in CSV but must be empty (collections do not use weight to determine sort order)||Allows creation of collection and members in same Islandora Workbench job.|