fedora gsearch | f i c i a l

Islandora 7 – SOLR Faceting by Collection Name (or Label) April 14, 2014

Posted by ficial in islandora, techy.
Tags: collection faceting, faceting, fedora commons, fedora gsearch, gsearch, islandora, islandora 7, solr
add a comment

One of the basic features we need from our Islandora system is the ability to facet search results by collection name. Getting this working turns out to be a non-trivial project. The essential problem is that although objects in islandora know in which collection(s) they reside, that information is stored in the object only as a relationship identified by a computer-y PID. If one uses that relationship directly to do faceting one gets something that looks like ‘somenamespace:stuffcollection’, rather than the actual name of the collection ‘Our Collection of Stuff’. In brief, the solution I used was to alter the way the objects were processed by fedoragsearch to send actual collection names rather than just PID info. I did this by extending the RELS-EXT processing to load and use the relevant collection data when handling isMemberOfCollection fields.

The faceting options that are available are determined by what information is in the SOLR indexes – faceting is NOT driven directly by the fedora object store! To allow faceting by collection name we need to tell SOLR the names of the collection(s) of the object. This means that, similar to getting full text search working, we need to touch both fedoragsearch system to deliver the desired info to SOLR, and the the SOLR config info to make the desired fields available for searching (and faceting).

In our fedoragsearch set up we already had pieces in place to process the RELS-EXT info, which is where collection membership (among other things) resides. This part of the objects FOXML looks somthing like this:

<foxml:datastream ID="RELS-EXT" STATE="A" CONTROL_GROUP="X" VERSIONABLE="true"> <foxml:datastreamVersion ID="RELS-EXT.0" LABEL="Fedora Object to Object Relationship Metadata." CREATED="2013-11-08T15:49:50.889Z" MIMETYPE="application/rdf+xml" FORMAT_URI="info:fedora/fedora-system:FedoraRELSExt-1.0" SIZE="548"> <foxml:xmlContent> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fedora="info:fedora/fedora-system:def/relations-external#" xmlns:fedora-model="info:fedora/fedora-system:def/model#" xmlns:islandora="http://islandora.ca/ontology/relsext#"> <rdf:Description rdf:about="info:fedora/somenamespace:52"> <fedora:isMemberOfCollection rdf:resource="info:fedora/somenamespace:projectid"/> <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp-audioCModel"/> </rdf:Description> </rdf:RDF> </foxml:xmlContent> </foxml:datastreamVersion> </foxml:datastream>

where the object has a PID of ‘somenamespace:52’ and is a member of the collection with PID ‘somenamespace:projectid’.

In the main gsearch_solr folder we have a sub-folder called islandora_transforms, in which there is a file called RELS-EXT_to_solr.xslt. This file is used by demoFoxmlToSolr.xslt via a straightforward include:

<xsl:include href="/usr/local/fedora/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/gsearch_solr/islandora_transforms/RELS-EXT_to_solr.xslt"/>

which intially was just this:

<?xml version="1.0" encoding="UTF-8"?>  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" exclude-result-prefixes="rdf"> <xsl:template match="foxml:datastream[@ID='RELS-EXT']/foxml:datastreamVersion[last()]" name='index_RELS-EXT'> <xsl:param name="content"/> <xsl:param name="prefix">RELS_EXT_</xsl:param> <xsl:param name="suffix">_ms</xsl:param> <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]"> <field> <xsl:attribute name="name"> <xsl:value-of select="concat($prefix, local-name(), '_uri', $suffix)"/> </xsl:attribute> <xsl:value-of select="@rdf:resource"/> </field> </xsl:for-each> <xsl:for-each select="$content//rdf:Description/*[not(@rdf:resource)][normalize-space(text())]"> <field> <xsl:attribute name="name"> <xsl:value-of select="concat($prefix, local-name(), '_literal', $suffix)"/> </xsl:attribute> <xsl:value-of select="text()"/> </field> </xsl:for-each> </xsl:template> </xsl:stylesheet>

The initial version of this file just directly processes the contents of the RELS-EXT datastream of the object’s FOXML, eventually creating the SOLR fields RELS_EXT_isMemberOfCollection_uri_ms/mt and RELS_EXT_hasModel_uri_ms/mt (fedoragsearch created the _uri info, which SOLR extends to the _ms/_mt versions). We can facet directly on those to get the desired breakdowns by collection (and by model, for that matter), but the text presented to the user is basically meaningless. So, I added some code to load the actual collection data for each isMemberOfCollection relation, and then pulled the human-readable collection title from that.

From my perpsective there were three particularly tricky parts to this (further complicated by my limited proficiency/understanding of XSLT and XPATH). First, how do I catch all the memberships and nothing else. Second, how do I get the actual collection PID. Third, how do I pull in and process additional content based on that PID. In bulling my ways past these obstacles I ended up with code that I’m dead sure isn’t as pretty or efficient as it could be, but on the plus side it works for me.

In step one I re-used the looping example already in the file to look through all the description sub-fields that have a resource attribute, which processes this data:

<rdf:Description rdf:about="info:fedora/somenamespace:52"> <fedora:isMemberOfCollection rdf:resource="info:fedora/somenamespace:projectid"/> <fedora-model:hasModel rdf:resource="info:fedora/islandora:sp-audioCModel"/> </rdf:Description>

and hits both the fedora:isMemberOfCollection and fedora-model:hasModel fields. To make sure I’m not accidentially processing models I added an if test that examines the name of the field and makes sure I’m only proceeding with further work on isMemberOfCollection fields (NOTE: I’ll probably be adding a branch for processing hasModel at some point as well – the code will be very similar). Once I’ve ensured that I’m working with the right data I need to get the PID of the collection. This part baffled me for a while because I hadn’t noticed that the value of that field wasn’t just the collection PID, it had ‘info:fedora/’ prepended to it. Once I realized what was going on I used a simple substring to pull out only the PID part. Lastly, I needed to pull and process the collection with that PID in order to get its human-readable title. Luckily I had an analagous example of that kind of thing in the external datastream processing that happens in demoFoxmlToSolr.xslt – I loaded the collection FOXML into a local variable, then processed that to pull out the title. Finally, once I’d grabbed all the text I needed, I created the appropriate fields to send on to SOLR (the collection_membership. prefix I used is one I just made up on the spot – there’s nothing special about it and it’s entirely possible that there’s some other structure/naming scheme I should be using instead). The final, modified RELS-EXT_to_solr.xslt looks like this:

<?xml version="1.0" encoding="UTF-8"?>  <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" exclude-result-prefixes="rdf">


    <xsl:template match="foxml:datastream[@ID='RELS-EXT']/foxml:datastreamVersion[last()]" name='index_RELS-EXT'>

      <xsl:param name="content"/>

      <xsl:param name="prefix">RELS_EXT_</xsl:param>

      <xsl:param name="suffix">_ms</xsl:param>
      <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]">

      <field>

        <xsl:attribute name="name">

        <xsl:value-of select="concat($prefix, local-name(), '_uri', $suffix)"/>

        </xsl:attribute>

        <xsl:value-of select="@rdf:resource"/>

      </field>

      </xsl:for-each>
      <xsl:for-each select="$content//rdf:Description/*[not(@rdf:resource)][normalize-space(text())]">

      <field>

        <xsl:attribute name="name">

        <xsl:value-of select="concat($prefix, local-name(), '_literal', $suffix)"/>

        </xsl:attribute>

        <xsl:value-of select="text()"/>

      </field>

      </xsl:for-each>
      <xsl:for-each select="$content//rdf:Description/*[@rdf:resource]">
      <xsl:if test="local-name()='isMemberOfCollection'">

        <xsl:variable name="collectionPID" select="substring-after(@rdf:resource,'info:fedora/')"/>

        <xsl:variable name="collectionContent" select="document(concat($PROT, '://', $FEDORAUSERNAME, ':', $FEDORAPASSWORD, '@', $HOST, ':', $PORT,'/fedora/objects/', $collectionPID, '/datastreams/', 'DC', '/content'))"/>
        <field name="collection_membership.pid_ms">

        <xsl:value-of select="$collectionPID"/>

        </field>
        <xsl:for-each select="$collectionContent//dc:title">

        <xsl:if test="local-name()='title'">

          <field name="collection_membership.title_ms">

          <xsl:value-of select="text()"/>

          </field>

          <field name="collection_membership.title_mt">

          <xsl:value-of select="text()"/>

          </field>

        </xsl:if>

        </xsl:for-each>
      </xsl:if>

  <!--

      <xsl:if test="local-name()='hasModel'">

        <xsl:variable name="modelPID" select="substring-after(@rdf:resource,'info:fedora/')"/>

        <field name="CSW_test_if_model">

        <xsl:value-of select="$modelPID"/>

        </field>

      </xsl:if>

  -->

      </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>

Similar to the approach I used for the OCR text work, I was able to watch the fedoragsearch logs and verify that the output was as expected/needed. The changes on the SOLR side are pretty minor. I added a couple of lines to schema.xml to handle the new fields:

<field name="collection_membership.title_ms" type="string" indexed="true" stored="true" multiValued="true"/> <field name="collection_membership.title_mt" type="text_fgs" indexed="true" stored="true" multiValued="true"/>

and, though not necessary for the faceting aspect of this work, I added collection_membership.title_mt to the field list of the standard search request handler in solrconfig.xml:

<requestHandler name="standard" class="solr.SearchHandler" default="true">  <lst name="defaults"> <str name="echoParams">explicit</str> <str name="fl">*</str> <str name="q.alt">*:*</str> <str name="qf"> PID dc.title .... .... collection_membership.title_mt </str> </lst> </requestHandler>

The final step is to re-index everything, and add the collection_membership.title_ms field to the list of facet fields in the web-based islandora solr config tool (Islandora > Solr index > Solr settings > Facet settings; add collection_membership.title_ms to the Facet fields, and give it a label of Collection).

And that’s that. If anyone has any suggestions/thoughts about how to improve my XSLT I’d be thrilled to hear them.

Islandora 7 – Making OCR text searchable via SOLR February 21, 2014

Posted by ficial in islandora, techy.
Tags: fedora commons, fedora gsearch, gsearch, islandora, islandora 7, ocr, solr
add a comment

We recently tackled the issue of making OCR-ed text searchable for Islandora 7. I had difficulty finding solid, targeted answers online, so here’s what we did in case anyone else needs to do this – hopefully you will be able to recoup some of the time I spent figuring this out. :)

First, a quick re-cap of OCR data-flow and how to inspect the parts of that data flow:

We have an image that has some text (verify by visual examination of the image)
That image is ingested and creates an object with a PID (verify by inspection of the fedora repository – e.g. http://fedorablahblahblah:8080/fedora/objects/PID, where PID looks like namespace:number (e.g. wonderfulcollection:43))
As a part of the ingest process the OCR tool runs and creates a managed datastream on the object; that datastream references the actual text generated (verify by lookign at the foxml of the object and that datastream of the object – http://fedorablahblahblah:8080/fedora/objects/PID/objectXML, and http://fedorablahblahblah:8080/fedora/objects/PID/datastreams/OCR/content)
The gsearch utility runs and pulls the FOXML of the newly created object and uses an xslt to generate from that an update request that’s sent to SOLR; that update request contains all the data that SOLR is to index (verify by looking at the fedora gsearch log $FEDORA_ROOT/server/logs/fedoragsearch.log to see the XML that’s being sent to SOLR)
SOLR processes the request and puts the data in its indices (based on entries in its schema.xml file) (verify by looking at the solr admin tool schema browser – http://fedorablahblahblah:8080/solr/admin/schema.jsp and finding the OCR field (open the FIELDS navigation element on the left, then do a text search on the page for ocr) and making sure it has indexed at least one document)
The indexed fields are available to be searched and returned based on the request handlers defined (in the solr_config.xml file – verify by finding (or adding) the name of the ocr field in the defualt search handler)
The islandora SOLR module is configured to use those fields (verfiy by searching for a string in the OCR-ed text and having the expected object in the search results; adjust the solr setting to have the ocr field available as an advanced search term and to adjust the default sort weights as desired)

To make this work there are 3 main parts that need to be resolved. First, the OCR process needs to work. Second, gsearch needs to send the resulting text to SOLR. Third, SOLR needs to process it and make it available for searching.

I. Making OCR work on ingest

This was something that I didn’t need to deal with because the excellent folks at Common Media did it for us – I’ll edit this to add a recap of the process/steps soon.

II. Making fedora gsearch handle the OCR-ed text

This is the part that gave us the most trouble.

The gsearch process works like this:

a. a new object is ingested into fedora
b. fedora sends a message to gsearch
c. gsearch runs an xslt that processes the objects FOXML to create XML for an add request to SOLR

Parts a and b we don’t really have to worry about here since there are no special changes that need to be made to accommodate OCR-ed text. Part c is where the complications lie. Fedora gsearch uses a file called demoFoxmlToSolr.xslt in $FEDORA_ROOT/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/gsearch_solr/. In that file we had to add a new template to create the OCR field:

  <xsl:template match="foxml:datastream[@ID='OCR']/foxml:datastreamVersion[last()]" name="index_text_nodes_as_a_text_field">
    <xsl:param name="content"/>
    <xsl:param name="prefix">ocr.</xsl:param>
    <xsl:param name="suffix"></xsl:param>
    <field>
      <xsl:attribute name="name">
        <xsl:value-of select="concat($prefix, ../@ID , $suffix)"/>
      </xsl:attribute>
      <xsl:variable name="text" select="normalize-space($content)"/>
      <!-- Only output non-empty text nodes (followed by a single space) -->
      <xsl:if test="$text">
        <xsl:value-of select="$text"/>
        <xsl:text> </xsl:text>
      </xsl:if>
    </field>
  </xsl:template>

and then we needed to call that template for the appropriate content and with the appropriate parameter:

  <xsl:when test="@CONTROL_GROUP='M' and @ID!='OCR'">
    <xsl:apply-templates select="foxml:datastreamVersion[last()]">
      <xsl:with-param name="content" select="document(concat($PROT, '://', $FEDORAUSERNAME, ':', $FEDORAPASSWORD, '@', $HOST, ':', $PORT, '/fedora/objects/', $PID, '/datastreams/', @ID, '/content'))"/>
    </xsl:apply-templates>
  </xsl:when>
  <!-- NOTE: OCR data is plain text rather than XML, so we can't use the document() function as above to get it -
       need exts:getDatastreamText() instead -->
  <xsl:when test="@CONTROL_GROUP='M' and @ID='OCR'">
    <xsl:apply-templates select="foxml:datastreamVersion[last()]">
      <xsl:with-param name="content" select="exts:getDatastreamText($PID, $REPOSITORYNAME,
      @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/>
    </xsl:apply-templates>
  </xsl:when>

In our initial attempts at this we ran into problems in both places. The template that generated the ocr field wasn’t working corrently, and we hadn’t realized that the document() function only work on XML docs and not on plain text. As a result we were getting the ocr.OCR field created, but it would have no content. After getting some information help from others and an extended session of experimenting and debugging on our end we arrived at the above working code.

DEBUGGING/DEVELOPMENT NOTES: If you have to do work in this area keep in mind that any changes to demoFoxmlToSolr.xslt will require a restart of fedora to take effect. To run the file we used the gsearch REST API (http://fedorablahblahblah:8080/fedoragsearch/rest?operation=updateIndex) and repeatedly deleted and added a known PID from/to the index (bottom two forms on that page). While we did that we watched the gsearch log file (tail -f $FEDORA_ROOT/server/logs/fedoragsearch.log) to see the resulting XML. If demoFoxmlToSolr.xslt has errors then trying to get to the REST API pages will give various nasty and confusing error messages.

III. Making SOLR handle the OCR-ed text

This involves changes in the SOLR config files in $FEDORA_ROOT/gsearch_solr/solr/conf – schema.xml and solrconfig.xml. The former file essentially controls how data is organized / handled / indexed. The latter file controls which data is accessible to search. The changes here are actually pretty easy. First, add an entry in schema.xml to handle the ocr field that gsearch is sending to SOLR:

 <fields> 
    ....
    <dynamicField name="ocr*" type="text_fgs"    indexed="true"  stored="true" multiValued="true"/>
    ....    
 </fields>

To make sure this is working, restart fedora (needed to make the above addition/change take effect) then delete and add/update the index for a single PID as described above. After that you should be able to check the SOLR schema browser (http://fedorablahblahblah:8080/solr/admin/schema.jsp) and find your OCR field and see that it has one document.

Once you’ve made sure it’s working you’ll need to update/recreate your SOLR indexes:

go to the gsearch REST API update index page http://fedorablahblahblah:8080/fedoragsearch/rest?operation=updateIndex
click the updateIndex createEmpty button
in a console session on your fedora server stop fedora (/etc/init.d/fedora stop)
in that console remove (or set aside) the old solr index (rm -rf $FEDORA_ROOT/gsearch_solr/solr/data/index)
in that console start fedora (/etc/init.d/fedora start)
on the REST API update index page, click updateIndex fromFoxmlFiles
wait for the index to finish updating (you can open another instance of the REST API update index page and to see the progress in the ‘Resulting number of index documents’ table cell – refresh that page periodically to see how many are indexed)

After that you should be able to check the SOLR schema browser (http://fedorablahblahblah:8080/solr/admin/schema.jsp) and find your OCR field and see that it has the expected number of documents indexed.

Lastly, you need to edit solrconfig.xml to have searches actually check that field. In that file find the standard request handler and add the field name (ocr.OCR) to the list of default fields:

  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*</str>
      <str name="q.alt">*:*</str>
      <str name="qf">
      PID
      dc.title
      ...
      ocr.OCR
      ...
      </str>
    </lst>
  </requestHandler>

Resart fedora, and you should now be able to do basic searches in your islandora site for strings in the OCR-ed text and have the expected object be in the search results. I recommend also adjusting your islandora solr search settings to add the ocr field to your advanced search fields.