Creating paged, compound, and collection content
Islandora Workbench provides three ways to create paged and compound content:
- using a subdirectory structure to define the relationship between the parent item and its children
- using page-level metadata in the CSV to establish that relationship
- using a secondary task.
Using subdirectories
Enable this method by including paged_content_from_directories: true
in your configuration file. Use this method when you are creating books, newspaper issues, or other paged content where your pages don't have their own metadata. This method groups page-level files into subdirectories that correspond to each parent, and does not require (or allow) page-level metadata in the CSV file. Each parent (book, newspaper issue, etc.) has a row on the CSV file, e.g.:
id,title,field_model,field_display_hints
book1,How to Use Islandora Workbench like a Pro,28,2
book2,Using Islandora Workbench for Fun and Profit,28,2
Note
Unlike every other Islandora Workbench "create" configuration, the metadata CSV should not contain a file
column. This means that content created using this method cannot be created using the same CSV file as other content.
Each parent's pages are located in a subdirectory of the input directory that is named to match the value of the id
field of the parent item they are pages of:
books/
├── book1
│ ├── page-001.jpg
│ ├── page-002.jpg
│ └── page-003.jpg
├── book2
│ ├── isbn-1843341778-001.jpg
│ ├── using-islandora-workbench-page-002.jpg
│ └── page-003.jpg
└── metadata.csv
The page filenames have significance. The sequence of the page is determined by the last segment of each filename before the extension, and is separated from the rest of the filename by a dash (-
), although you can use another character by setting the paged_content_sequence_separator
option in your configuration file. For example, using the filenames for "book1" above, the sequence of "page-001.jpg" is "001". Dashes (or whatever your separator character is) can exist elsewhere in filenames, since Workbench will always use the string after the last dash as the sequence number; for example, the sequence of "isbn-1843341778-001.jpg" for "book2" is also "001". Workbench takes this sequence number, strips of any leader zeros, and uses it to populate the field_weight
in the page nodes, so "001" becomes a weight value of 1, "002" becomes a weight value of 2, and so on.
Titles for pages are generated automatically using the pattern parent_title
+ , page
+ sequence_number
, where "parent title" is inherited from the page's parent node and "sequence number" is the page's sequence. For example, if a page's parent has the title "How to Write a Book" and its sequence number is 450, its automatically generated title will be "How to Write a Book, page 450". You can override this pattern by including the page_title_template
setting in your configuration file. The value of this setting is a simple string template. The default, which generates the page title pattern described above, is '$parent_title, page $weight'
. There are only two variables you can include in the template, $parent_title
and $weight
, although you do not need to include either one if you don't want that information appearing in your page titles.
Finally, even though only minimal metadata is assigned to pages using this method (i.e., the automatically generated title and Islandora model), you can add additional metadata to pages using a separate update
task.
Important things to note when using this method:
- To use this method of creating paged content, you must include
paged_content_page_model_tid
in your configuration file and set it to your Islandora's term ID for the "Page" term in the Islandora Models vocabulary (or tohttp://id.loc.gov/ontologies/bibframe/part
). - The Islandora model of the parent is not set automatically. You need to include a
field_model
value for each item in your CSV file. - You should also include a
field_display_hints
column in your CSV. This value is applied to the parent nodes and also the page nodes, unless thepaged_content_page_display_hints
setting is present in you configuration file. However, if you normally don't set the "Display hints" field in your objects but use a Context to determine how objects display, you should not include afield_display_hints
column in your CSV file. id
can be defined as another field name using theid_field
configuration option. If you do define a different ID field using theid_field
option, creating the parent/paged item relationships will still work.- The Drupal content type for page nodes is inherited from the parent, unless you specify a different content type in the
paged_content_page_content_type
setting in your configuration file.
With page/child-level metadata
Using this method, the metadata CSV file contains a row for every item, both parents and children. You should use this method when you are creating books, newspaper issues, or other paged or compound content where each page has its own metadata, or when you are creating compound objects of any Islandora model. The file for each page/child is named explicitly in the page/child's file
column rather than being in a subdirectory. To link the pages to the parent, Workbench establishes parent/child relationships between items with parent_id
values (the pages/children) with that are the same as the id
value of another item (the parent). For this to work, your CSV file must contain a parent_id
field plus the standard Islandora fields field_weight
, field_member_of
, and field_model
(the role of these last three fields will be explained below). The id
field is required in all CSV files used to create content, so in this case, your CSV needs both an id
field and a parent_id
field.
The following example illustrates how this works. Here is the raw CSV data:
id,parent_id,field_weight,file,title,field_description,field_model,field_member_of
001,,,,Postcard 1,The first postcard,28,197
003,001,1,image456.jpg,Front of postcard 1,The first postcard's front,29,
004,001,2,image389.jpg,Back of postcard 1,The first postcard's back,29,
002,,,,Postcard 2,The second postcard,28,197
006,002,1,image2828.jpg,Front of postcard 2,The second postcard's front,29,
007,002,2,image777.jpg,Back of postcard 2,The second postcard's back,29,
The empty cells make this CSV difficult to read. Here is the same data in a spreadsheet:
The data contains rows for two postcards (rows with id
values "001" and "002") plus a back and front for each (the remaining four rows). The parent_id
value for items with id
values "003" and "004" is the same as the id
value for item "001", which will tell Workbench to make both of those items children of item "001"; the parent_id
value for items with id
values "006" and "007" is the same as the id
value for item "002", which will tell Workbench to make both of those items children of the item "002". We can't populate field_member_of
for the child pages in our CSV because we won't have node IDs for the parents until they are created as part of the same batch as the children.
In this example, the rows for our postcard objects have empty parent_id
, field_weight
, and file
columns because our postcards are not children of other nodes and don't have their own media. (However, the records for our postcard objects do have a value in field_member_of
, which is the node ID of the "Postcards" collection that already/hypothetically exists.) Rows for the postcard front and back image objects have a value in their field_weight
field, and they have values in their file
column because we are creating objects that contain image media. Importantly, they have no value in their field_member_of
field because the node ID of the parent isn't known when you create your CSV; instead, Islandora Workbench assigns each child's field_member_of
dynamically, just after its parent node is created.
Some important things to note:
- Currently, you need to include the option
allow_missing_files: true
in your configuration file when using this method to create paged/compound content. See this issue for more information. id
can be defined as another field name using theid_field
configuration option. If you do define a different ID field using theid_field
option, creating the parent/child relationships will still work.- The values of the
id
andparent_id
columns do not have to follow any sequential pattern. Islandora Workbench treats them as simple strings and matches them on that basis, without looking for sequential relationships of any kind between the two fields. - The CSV records for children items don't need to come immediately after the record for their parent, but they do need to come after that CSV record. (
--check
will tell you if it finds any child rows that come before their parent rows.) This is because Workbench creates nodes in the order their records are in the CSV file (top to bottom). As long as the parent node has already been created when a child node is created, the parent/child relationship via the child'sfield_member_of
will be correct. See the next paragraph for some suggestions on planning for large ingests of paged or compound items. - Currently, you must include values in the children's
field_weight
column (except when creating a collection and its members at the same time; see below). It may be possible to automatically generate values for this field (see this issue). - Currently, Islandora model values (e.g. "Paged Content", "Page") are not automatically assigned. You must include the correct "Islandora Models" taxonomy term IDs in your
field_model
column for all parent and child records, as you would for any other Islandora objects you are creating. Like forfield_weight
, it may be possible to automatically generate values for this field (see this issue).
Since parent items (collections, book-level items, newspaper issue-level items, top-level items in compound items, etc.) need to exist in Drupal before their children can be ingested, you need to plan your "create" tasks accordingly. For example:
- If you want to use a single "create" task to ingest all the parents and children at the same time, for each compound item, the parent CSV record must come before the records for the children/pages.
- If you would rather use multiple "create" tasks, you can create all your collections first, then, in subsequent "create" tasks, use their respective node IDs in the
field_member_of
CSV column for their members. If you use a separate "create" task to create members of a single collection, you can define the value offield_member_of
in a CSV field template. - If you are ingesting a large set of books, you can ingest the book-level items first, then use their node IDs in a separate CSV for the pages of all books (each using their parent book nodes node ID in their
field_member_of
column). Or, you could run a separate "create" task for each book, and use a CSV field template containing afield_member_of
entry containing the book item's node ID. - For newspapers, you could create the top-level newspaper first, then use its node ID in a subsequent "create" task for both newspaper issues and pages. In this task, the
field_member_of
column in rows for newspaper issues would contain the newspaper's node ID, but the rows for newspaper pages would have a blankfield_member_of
and aparent_id
using the parent issue'sid
value.
Using a secondary task
You can configure Islandora Workbench to execute two "create" tasks - a primary and a secondary - that will result in all of the objects described in both CSV files being ingested during the same Workbench job. Parent/child relationships between items are created by referencing the row IDs in the primary task's CSV file from the secondary task's CSV file. The benefit of using this method is that each task has its own configuration file, allowing you to create children that have a different Drupal content type than their parents.
The primary task's CSV describes the parent objects, and the secondary task's CSV describes the children. The two are linked via references from children CSV's parent_id
values to their parent's id
values, much the same way as in the "With page/child-level metadata" method described above. The difference is that the references span CSV files. The parents and children each have their own CSV input file (and in fact, their own configuration file). Each task is a standard Islandora Workbench "create" task, joined by one setting in the primary's configuration file.
In the following example, the top CSV file (the primary) describes the parents, and the bottom CSV file (the secondary) describes the children:
As you can see, values in the parent_id
column in the secondary CSV reference values in the id
column in the primary CSV: parent_id
001 in the secondary CSV matches id
001 in the primary, parent_id
003 in the secondary matches id
003 in the primary, and so on.
You configure secondary tasks by adding the secondary_tasks
setting to your primary configuration file, like this:
task: create
host: "http://localhost:8000"
username: admin
password: islandora
# This is the setting that links the two configuration files together.
secondary_tasks: ['children.yml']
input_csv: parents.csv
nodes_only: true
In the secondary_tasks
setting, you name the configuration file of the secondary task. The secondary task's configuration file (in this example, named "children.yml") contains no indication that it's a secondary task:
task: create
host: "http://localhost:8000"
username: admin
password: islandora
input_csv: kids.csv
csv_field_templates:
- field_model: http://purl.org/coar/resource_type/c_c513
Note
The nodes_only
setting in the above example primary configuration file and the csv_field_templates
setting in the secondary configuration file are not relevant to the primary/secondary task functionality; they're included to illustrate that the two configuration files can differ.
When you run Workbench, it executes the primary task first, then the secondary task. Workbench keeps track of pairs of id
+ node IDs created in the primary task, and during the execution of the secondary task, uses these to populate the field_member_of
values in the secondary task with the node IDs corresponding to the referenced primary id
values.
Some things to note about secondary tasks:
- Only "create" tasks can be used as the primary and secondary tasks.
- When you have a secondary task configured, running
--check
will validate both tasks' configuration and input data. - The secondary CSV must contain
parent_id
andfield_member_of
columns.field_member_of
must be empty, since it is auto-populated by Workbench using node IDs from the newly created parent objects. If you want to assign an order to the child objects within each parent object, includefield_weight
with the appropiate values (1, 2, 3, etc., the lower numbers being earlier/higher in sort order). - If a row in the secondary task CSV does not have a
parent_id
that matches anid
of a row in the primary CSV, or if there is a matching row in the primary CSV and Workbench failed to create the described node, Workbench will skip creating the child and add an entry to the log indicating it did so. - As already stated, each task has its own configuration file, which means that you can specify a
content_type
value in your secondary configuration file that differs from thecontent_type
of the primary task. - You can include more than one secondary task in your configuration. For example,
secondary_tasks: ['first.yml', 'second.yml']
will execute the primary task, then the "first.yml" secondary task, then the "second.yml" secondary task in that order. You would use multiple secondary tasks if you wanted to add children of different content types to the parent nodes.
Specifying paths to the python interpreter and to the workbench script
When using secondary tasks, there are a couple of situations where you may need to tell Workbench where the python interpreter is located, and where the "workbench" script is located.
The first is when you use a secondary task within a scheduled job (such as running Workbench via Linux's cron). Depending on how you configure the cron job, you will likely need to tell Workbench what the absolute paths to the python interpreter is and what the path to the workbench script is. This is because, unless your cronjob changes into Workbench's working directory, Workbench will be looking in the wrong directory for the secondary task. The two config options you should use are:
path_to_python
path_to_workbench_script
An example of using these settings is:
secondary_tasks: ['children.yml']
path_to_python: '/usr/bin/python'
path_to_workbench_script: '/home/mark/islandora_workbench/workbench'
The second situation is when using a secondary task when running Workbench in Windows and "python.exe" is not in the PATH of the user running the scheduled job. Specifying the absolute path to "python.exe" will ensure that Workbench can execture the secondary task properly, like this:
secondary_tasks: ['children.yml']
path_to_python: 'c:/program files/python39/python.exe'
path_to_workbench_script: 'd:/users/mark/islandora_workbench/workbench'
Creating collections and members together
Using a variation of the "With page/child-level metadata" approach, you can create a collection node and assign members to it at the same time (i.e., in a single Workbench job). Here is a simple example CSV which shows the references from the members' parent_id
field to the collections' id
field:
id,parent_id,file,title,field_model,field_member_of,field_weight
1,,,A collection of animal photos,24,,
2,1,cat.jpg,Picture of a cat,25,,
3,1,dog.jpg,Picture of a dog,25,,
3,1,horse.jpg,Picture of a horse,25,,
The use of the parent_id
and field_member_of
fields is the same here as when creating paged or compound children. However, unlike with paged or compound objects, in this case we leave the values in field_weight
empty, since Islandora collections don't use field_weight
to determine order of members. Collection Views are sorted using other fields.
Warning
Creating collection nodes and member nodes using this method assumes that collection nodes and member nodes have the same Drupal content type. If your collection objects have a Drupal content type that differs from their members' content type, you need to use the "Using a secondary task" method to ingest collections and members in the same Workbench job.
Summary
The following table summarizes the different ways Workbench can be used to create parent/child relationships between nodes:
Method | Relationships created by | field_weight | Advantage |
---|---|---|---|
Subdirectories | Directory structure | Do not include column in CSV; autopopulated. | Useful for creating paged content where paged don't have their own metadata. |
Parent/child-level metadata in same CSV | References from child's parent_id to parent's id in same CSV data |
Column required; values required in child rows | Allows including parent and child metadata in same CSV. |
Secondary task | References from parent_id in child CSV file to id in parent CSV file |
Column and values recommended in secondary (child) CSV data | Primary and secondary tasks have their own configuration and CSV files, which allows children to have a Drupal content type that differs from their parents' content type. Allows creation of parents and children in same Workbench job. |
Collections and members together | References from child (member) parent_id fields to parent (collection) id fields in same CSV data |
Column required in CSV but must be empty (collections do not use weight to determine sort order) | Allows creation of collection and members in same Islandora Workbench job. |