jump to navigation

Islandora 7 – splitting CSV data on ingest June 24, 2014

Posted by ficial in code fixes, islandora, techy, xsl.
trackback

TLDR:
It’s tricky to tokenize CSV values on ingest using a MODS form. To do so, create a self-transform XSL and manually tokenize the appropriate fields – create an XSL to do the tokenizing in …./sites/all/modules/islandora_xml_forms/builder/self_transforms/, then set that as the self-transform for the relevant form. You’ll need to create your own CSV tokenizer since Islandora 7 uses an older version of XSL. See below for example code.

LONG FORM:
In our Islandora install we’re using MODS as the main meta-data schema. That is, the ingest forms are set up for generating MODS XML. However, the way the form is set up is anti-helpful for some of the people that are doing our data loads. Specifically, the subject-topic, subject-geographic, and subject-temporal fields were not being processed as people expected.

Those three fields are multi-value ones, meaning they support a structure like:

...
<subject>
  <topic>cows</topic>
  <topic>bovines</topic>
  <topic>farm animals</topic>
  <geographic>field</geographic>
  <geographic>farm</geographic>
  <temporal>historic</temporal>
  <temporal>1800s</temporal>
</subject>
...

However, when using the form we want to be able to enter them as CSV values – e.g. ‘cows, bovines, farm animals’. Unfortunately, the default behavior is to treat such as a single value, giving a result like:

...
<subject>
  <topic>cows, bovines, farm animals</topic>
  <geographic>field, farm</geographic>
  <temporal>historic, 1800s</temporal>
</subject>
...

The Islandora 7 ingest forms system does provide a place where this can be corrected, but it’s subtle and tricky. Specifically, one has to create an XSL to do the proper tokenizing and set that up as a ‘self transform’ for the form. Creating the tokenizing XSL is in turn made more difficult because Islandora 7 uses XSL earlier than 2.0, which means that there is no built in tokekizing function. The place this needs to be done is in …/sites/all/modules/islandora_xml_forms/builder/self_transforms/, which took me a while to find because I was mis-lead by the ‘builder’ folder – code in that folder relates not only to the building of forms, but also the using/processing of forms.

Following some suggestions on various sites, I organized my tokenizing code in a separate file and included/imported it into the self-transform. Here’s where I ended up:

TOKENIZER (csv_tokenizer.xsl):
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:mods="http://www.loc.gov/mods/v3">
    <xsl:template name="csvtokenizer" >
      <xsl:param name="commaStr"/>
      <xsl:param name="tagLabel"/>
      <xsl:if test="normalize-space($commaStr) != ''">
        <xsl:choose>
          <xsl:when test="contains($commaStr, ',')">
            <xsl:call-template name="csvtokenizer">
              <xsl:with-param name="commaStr" select="substring-before($commaStr,',')"/>
              <xsl:with-param name="tagLabel" select="$tagLabel"/>
            </xsl:call-template>
            <xsl:call-template name="csvtokenizer">
              <xsl:with-param name="commaStr" select="substring-after($commaStr,',')"/>
              <xsl:with-param name="tagLabel" select="$tagLabel"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <xsl:if test="normalize-space($tagLabel) != ''">
              <xsl:element name="{$tagLabel}">
                <xsl:value-of select="substring($commaStr, string-length(substring-before($commaStr, substring(normalize-space($commaStr), 1, 1))) +   1)"/>
              </xsl:element>
            </xsl:if>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:if>
    </xsl:template>
</xsl:stylesheet>

SELF TRANSFORM (cleanup_mods.xsl - NOTE: this also removes empty fields):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:mods="http://www.loc.gov/mods/v3">
<xsl:import href="csv_tokenizer.xsl"/>
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" media-type="text/xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*[not(node())]"/>
<xsl:template match="mods:subject/mods:topic">
  <xsl:call-template name="csvtokenizer">
    <xsl:with-param name="commaStr" select="normalize-space(.)"/>
    <xsl:with-param name="tagLabel" select="'mods:topic'"/>
  </xsl:call-template>
</xsl:template>
<xsl:template match="mods:subject/mods:geographic">
  <xsl:call-template name="csvtokenizer">
    <xsl:with-param name="commaStr" select="normalize-space(.)"/>
    <xsl:with-param name="tagLabel" select="'mods:geographic'"/>
  </xsl:call-template>
</xsl:template>
<xsl:template match="mods:subject/mods:temporal">
  <xsl:call-template name="csvtokenizer">
    <xsl:with-param name="commaStr" select="normalize-space(.)"/>
    <xsl:with-param name="tagLabel" select="'mods:temporal'"/>
  </xsl:call-template>
</xsl:template>
<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="node()[normalize-space()]|@*[normalize-space()]"/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

I could have combined the three tokenizing template matches into a single one with or-ed parameters and dynamic tag label, but I find the code here much easier to read and the maintenance cost very low.

The self-transform runs before any other transforms, so the splitting done here propagates downstreams without any further work.

Advertisements

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: