jump to navigation

Islandora 7 – SOLR Faceting by Collection Name (or Label) April 14, 2014

Posted by ficial in islandora, techy.
Tags: , , , , , , ,
trackback

One of the basic features we need from our Islandora system is the ability to facet search results by collection name. Getting this working turns out to be a non-trivial project. The essential problem is that although objects in islandora know in which collection(s) they reside, that information is stored in the object only as a relationship identified by a computer-y PID. If one uses that relationship directly to do faceting one gets something that looks like ‘somenamespace:stuffcollection’, rather than the actual name of the collection ‘Our Collection of Stuff’. In brief, the solution I used was to alter the way the objects were processed by fedoragsearch to send actual collection names rather than just PID info. I did this by extending the RELS-EXT processing to load and use the relevant collection data when handling isMemberOfCollection fields.

The faceting options that are available are determined by what information is in the SOLR indexes – faceting is NOT driven directly by the fedora object store! To allow faceting by collection name we need to tell SOLR the names of the collection(s) of the object. This means that, similar to getting full text search working, we need to touch both fedoragsearch system to deliver the desired info to SOLR, and the the SOLR config info to make the desired fields available for searching (and faceting).

In our fedoragsearch set up we already had pieces in place to process the RELS-EXT info, which is where collection membership (among other things) resides. This part of the objects FOXML looks somthing like this:

  <foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true">
    <foxml:datastreamVersion ID="RELS-EXT.0" LABEL="Fedora Object to Object Relationship Metadata." CREATED="2013-11-08T15:49:50.889Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="548">
      <foxml:xmlContent>
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:islandora="http://islandora.ca/ontology/relsext#">
          <rdf:Description rdf:about="info:fedora/somenamespace:52">
            <fedora:isMemberOfCollection rdf:resource="info:fedora/somenamespace:projectid"/>
            <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp-audioCModel"/>
          </rdf:Description>
        </rdf:RDF>
      </foxml:xmlContent>
    </foxml:datastreamVersion>
  </foxml:datastream>

where the object has a PID of ‘somenamespace:52’ and is a member of the collection with PID ‘somenamespace:projectid’.

In the main gsearch_solr folder we have a sub-folder called islandora_transforms, in which there is a file called RELS-EXT_to_solr.xslt. This file is used by demoFoxmlToSolr.xslt via a straightforward include:

  <xsl:include href="/usr/local/fedora/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/gsearch_solr/islandora_transforms/RELS-EXT_to_solr.xslt"/>

which intially was just this:

  <?xml version="1.0" encoding="UTF-8"?>
  <!-- RELS-EXT -->
  <xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:foxml="info:fedora/fedora-system:def/foxml#"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    exclude-result-prefixes="rdf">
    <xsl:template match="foxml:datastream[@ID='RELS-EXT']/foxml:datastreamVersion[last()]" name='index_RELS-EXT'>
    <xsl:param name="content"/>
    <xsl:param name="prefix">RELS_EXT_</xsl:param>
    <xsl:param name="suffix">_ms</xsl:param>
    <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]">
      <field>
      <xsl:attribute name="name">
        <xsl:value-of select="concat($prefix, local-name(), '_uri', $suffix)"/>
      </xsl:attribute>
      <xsl:value-of select="@rdf:resource"/>
      </field>
    </xsl:for-each>
    <xsl:for-each select="$content//rdf:Description/*[not(@rdf:resource)][normalize-space(text())]">
    <field>
        <xsl:attribute name="name">
        <xsl:value-of select="concat($prefix, local-name(), '_literal', $suffix)"/>
        </xsl:attribute>
      <xsl:value-of select="text()"/>
        </field>
    </xsl:for-each>
  </xsl:template>
  </xsl:stylesheet>

The initial version of this file just directly processes the contents of the RELS-EXT datastream of the object’s FOXML, eventually creating the SOLR fields RELS_EXT_isMemberOfCollection_uri_ms/mt and RELS_EXT_hasModel_uri_ms/mt (fedoragsearch created the _uri info, which SOLR extends to the _ms/_mt versions). We can facet directly on those to get the desired breakdowns by collection (and by model, for that matter), but the text presented to the user is basically meaningless. So, I added some code to load the actual collection data for each isMemberOfCollection relation, and then pulled the human-readable collection title from that.

From my perpsective there were three particularly tricky parts to this (further complicated by my limited proficiency/understanding of XSLT and XPATH). First, how do I catch all the memberships and nothing else. Second, how do I get the actual collection PID. Third, how do I pull in and process additional content based on that PID. In bulling my ways past these obstacles I ended up with code that I’m dead sure isn’t as pretty or efficient as it could be, but on the plus side it works for me.

In step one I re-used the looping example already in the file to look through all the description sub-fields that have a resource attribute, which processes this data:

  <rdf:Description rdf:about="info:fedora/somenamespace:52">
    <fedora:isMemberOfCollection rdf:resource="info:fedora/somenamespace:projectid"/>
    <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp-audioCModel"/>
  </rdf:Description>

and hits both the fedora:isMemberOfCollection and fedora-model:hasModel fields. To make sure I’m not accidentially processing models I added an if test that examines the name of the field and makes sure I’m only proceeding with further work on isMemberOfCollection fields (NOTE: I’ll probably be adding a branch for processing hasModel at some point as well – the code will be very similar). Once I’ve ensured that I’m working with the right data I need to get the PID of the collection. This part baffled me for a while because I hadn’t noticed that the value of that field wasn’t just the collection PID, it had ‘info:fedora/’ prepended to it. Once I realized what was going on I used a simple substring to pull out only the PID part. Lastly, I needed to pull and process the collection with that PID in order to get its human-readable title. Luckily I had an analagous example of that kind of thing in the external datastream processing that happens in demoFoxmlToSolr.xslt – I loaded the collection FOXML into a local variable, then processed that to pull out the title. Finally, once I’d grabbed all the text I needed, I created the appropriate fields to send on to SOLR (the collection_membership. prefix I used is one I just made up on the spot – there’s nothing special about it and it’s entirely possible that there’s some other structure/naming scheme I should be using instead). The final, modified RELS-EXT_to_solr.xslt looks like this:

  <?xml version="1.0" encoding="UTF-8"?>
  <!-- RELS-EXT -->
  <xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:foxml="info:fedora/fedora-system:def/foxml#"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    exclude-result-prefixes="rdf">

    <xsl:template match="foxml:datastream[@ID='RELS-EXT']/foxml:datastreamVersion[last()]" name='index_RELS-EXT'>
    <xsl:param name="content"/>
    <xsl:param name="prefix">RELS_EXT_</xsl:param>
    <xsl:param name="suffix">_ms</xsl:param>

    <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]">
      <field>
      <xsl:attribute name="name">
        <xsl:value-of select="concat($prefix, local-name(), '_uri', $suffix)"/>
      </xsl:attribute>
      <xsl:value-of select="@rdf:resource"/>
      </field>
    </xsl:for-each>

    <xsl:for-each select="$content//rdf:Description/*[not(@rdf:resource)][normalize-space(text())]">
      <field>
      <xsl:attribute name="name">
        <xsl:value-of select="concat($prefix, local-name(), '_literal', $suffix)"/>
      </xsl:attribute>
      <xsl:value-of select="text()"/>
      </field>
    </xsl:for-each>

    <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]">

      <xsl:if test="local-name()='isMemberOfCollection'">
      <xsl:variable name="collectionPID" select="substring-after(@rdf:resource,'info:fedora/')"/>
      <xsl:variable name="collectionContent" select="document(concat($PROT, '://', $FEDORAUSERNAME, ':', $FEDORAPASSWORD, '@', $HOST, ':', $PORT,'/fedora/objects/', $collectionPID, '/datastreams/', 'DC', '/content'))"/>

      <field name="collection_membership.pid_ms">
        <xsl:value-of select="$collectionPID"/>
      </field>

      <xsl:for-each select="$collectionContent//dc:title">
        <xsl:if test="local-name()='title'">
        <field name="collection_membership.title_ms">
          <xsl:value-of select="text()"/>
        </field>
        <field name="collection_membership.title_mt">
          <xsl:value-of select="text()"/>
        </field>
        </xsl:if>
      </xsl:for-each>

      </xsl:if>
  <!--
      <xsl:if test="local-name()='hasModel'">
      <xsl:variable name="modelPID" select="substring-after(@rdf:resource,'info:fedora/')"/>
      <field name="CSW_test_if_model">
        <xsl:value-of select="$modelPID"/>
      </field>
      </xsl:if>
  -->
    </xsl:for-each>

    </xsl:template>

  </xsl:stylesheet>

Similar to the approach I used for the OCR text work, I was able to watch the fedoragsearch logs and verify that the output was as expected/needed. The changes on the SOLR side are pretty minor. I added a couple of lines to schema.xml to handle the new fields:

<field name="collection_membership.title_ms" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="collection_membership.title_mt" type="text_fgs" indexed="true" stored="true" multiValued="true"/>

and, though not necessary for the faceting aspect of this work, I added collection_membership.title_mt to the field list of the standard search request handler in solrconfig.xml:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
  <!-- default values for query parameters -->
  <lst name="defaults">
  <str name="echoParams">explicit</str>
  <str name="fl">*</str>
  <str name="q.alt">*:*</str>
  <str name="qf">
  PID
  dc.title
  ....
  ....
  collection_membership.title_mt
  </str>
  </lst>
</requestHandler>

The final step is to re-index everything, and add the collection_membership.title_ms field to the list of facet fields in the web-based islandora solr config tool (Islandora > Solr index > Solr settings > Facet settings; add collection_membership.title_ms to the Facet fields, and give it a label of Collection).

And that’s that. If anyone has any suggestions/thoughts about how to improve my XSLT I’d be thrilled to hear them.

Advertisements

Comments»

No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: