Islandora 7 – Making OCR text searchable via SOLR February 21, 2014Posted by ficial in islandora, techy.
Tags: fedora commons, fedora gsearch, gsearch, islandora, islandora 7, ocr, solr
We recently tackled the issue of making OCR-ed text searchable for Islandora 7. I had difficulty finding solid, targeted answers online, so here’s what we did in case anyone else needs to do this – hopefully you will be able to recoup some of the time I spent figuring this out. :)
First, a quick re-cap of OCR data-flow and how to inspect the parts of that data flow:
- We have an image that has some text (verify by visual examination of the image)
- That image is ingested and creates an object with a PID (verify by inspection of the fedora repository – e.g. http://fedorablahblahblah:8080/fedora/objects/PID, where PID looks like namespace:number (e.g. wonderfulcollection:43))
- As a part of the ingest process the OCR tool runs and creates a managed datastream on the object; that datastream references the actual text generated (verify by lookign at the foxml of the object and that datastream of the object – http://fedorablahblahblah:8080/fedora/objects/PID/objectXML, and http://fedorablahblahblah:8080/fedora/objects/PID/datastreams/OCR/content)
- The gsearch utility runs and pulls the FOXML of the newly created object and uses an xslt to generate from that an update request that’s sent to SOLR; that update request contains all the data that SOLR is to index (verify by looking at the fedora gsearch log $FEDORA_ROOT/server/logs/fedoragsearch.log to see the XML that’s being sent to SOLR)
- SOLR processes the request and puts the data in its indices (based on entries in its schema.xml file) (verify by looking at the solr admin tool schema browser – http://fedorablahblahblah:8080/solr/admin/schema.jsp and finding the OCR field (open the FIELDS navigation element on the left, then do a text search on the page for ocr) and making sure it has indexed at least one document)
- The indexed fields are available to be searched and returned based on the request handlers defined (in the solr_config.xml file – verify by finding (or adding) the name of the ocr field in the defualt search handler)
- The islandora SOLR module is configured to use those fields (verfiy by searching for a string in the OCR-ed text and having the expected object in the search results; adjust the solr setting to have the ocr field available as an advanced search term and to adjust the default sort weights as desired)
To make this work there are 3 main parts that need to be resolved. First, the OCR process needs to work. Second, gsearch needs to send the resulting text to SOLR. Third, SOLR needs to process it and make it available for searching.
I. Making OCR work on ingest
This was something that I didn’t need to deal with because the excellent folks at Common Media did it for us – I’ll edit this to add a recap of the process/steps soon.
II. Making fedora gsearch handle the OCR-ed text
This is the part that gave us the most trouble.
The gsearch process works like this:
a. a new object is ingested into fedora
b. fedora sends a message to gsearch
c. gsearch runs an xslt that processes the objects FOXML to create XML for an add request to SOLR
Parts a and b we don’t really have to worry about here since there are no special changes that need to be made to accommodate OCR-ed text. Part c is where the complications lie. Fedora gsearch uses a file called demoFoxmlToSolr.xslt in $FEDORA_ROOT/tomcat/webapps/fedoragsearch/WEB-INF/classes/config/index/gsearch_solr/. In that file we had to add a new template to create the OCR field:
<xsl:template match="foxml:datastream[@ID='OCR']/foxml:datastreamVersion[last()]" name="index_text_nodes_as_a_text_field"> <xsl:param name="content"/> <xsl:param name="prefix">ocr.</xsl:param> <xsl:param name="suffix"></xsl:param> <field> <xsl:attribute name="name"> <xsl:value-of select="concat($prefix, ../@ID , $suffix)"/> </xsl:attribute> <xsl:variable name="text" select="normalize-space($content)"/> <!-- Only output non-empty text nodes (followed by a single space) --> <xsl:if test="$text"> <xsl:value-of select="$text"/> <xsl:text> </xsl:text> </xsl:if> </field> </xsl:template>
and then we needed to call that template for the appropriate content and with the appropriate parameter:
<xsl:when test="@CONTROL_GROUP='M' and @ID!='OCR'"> <xsl:apply-templates select="foxml:datastreamVersion[last()]"> <xsl:with-param name="content" select="document(concat($PROT, '://', $FEDORAUSERNAME, ':', $FEDORAPASSWORD, '@', $HOST, ':', $PORT, '/fedora/objects/', $PID, '/datastreams/', @ID, '/content'))"/> </xsl:apply-templates> </xsl:when> <!-- NOTE: OCR data is plain text rather than XML, so we can't use the document() function as above to get it - need exts:getDatastreamText() instead --> <xsl:when test="@CONTROL_GROUP='M' and @ID='OCR'"> <xsl:apply-templates select="foxml:datastreamVersion[last()]"> <xsl:with-param name="content" select="exts:getDatastreamText($PID, $REPOSITORYNAME, @ID, $FEDORASOAP, $FEDORAUSER, $FEDORAPASS, $TRUSTSTOREPATH, $TRUSTSTOREPASS)"/> </xsl:apply-templates> </xsl:when>
In our initial attempts at this we ran into problems in both places. The template that generated the ocr field wasn’t working corrently, and we hadn’t realized that the document() function only work on XML docs and not on plain text. As a result we were getting the ocr.OCR field created, but it would have no content. After getting some information help from others and an extended session of experimenting and debugging on our end we arrived at the above working code.
DEBUGGING/DEVELOPMENT NOTES: If you have to do work in this area keep in mind that any changes to demoFoxmlToSolr.xslt will require a restart of fedora to take effect. To run the file we used the gsearch REST API (http://fedorablahblahblah:8080/fedoragsearch/rest?operation=updateIndex) and repeatedly deleted and added a known PID from/to the index (bottom two forms on that page). While we did that we watched the gsearch log file (tail -f $FEDORA_ROOT/server/logs/fedoragsearch.log) to see the resulting XML. If demoFoxmlToSolr.xslt has errors then trying to get to the REST API pages will give various nasty and confusing error messages.
III. Making SOLR handle the OCR-ed text
This involves changes in the SOLR config files in $FEDORA_ROOT/gsearch_solr/solr/conf – schema.xml and solrconfig.xml. The former file essentially controls how data is organized / handled / indexed. The latter file controls which data is accessible to search. The changes here are actually pretty easy. First, add an entry in schema.xml to handle the ocr field that gsearch is sending to SOLR:
<fields> .... <dynamicField name="ocr*" type="text_fgs" indexed="true" stored="true" multiValued="true"/> .... </fields>
To make sure this is working, restart fedora (needed to make the above addition/change take effect) then delete and add/update the index for a single PID as described above. After that you should be able to check the SOLR schema browser (http://fedorablahblahblah:8080/solr/admin/schema.jsp) and find your OCR field and see that it has one document.
Once you’ve made sure it’s working you’ll need to update/recreate your SOLR indexes:
- go to the gsearch REST API update index page http://fedorablahblahblah:8080/fedoragsearch/rest?operation=updateIndex
- click the updateIndex createEmpty button
- in a console session on your fedora server stop fedora (/etc/init.d/fedora stop)
- in that console remove (or set aside) the old solr index (rm -rf $FEDORA_ROOT/gsearch_solr/solr/data/index)
- in that console start fedora (/etc/init.d/fedora start)
- on the REST API update index page, click updateIndex fromFoxmlFiles
- wait for the index to finish updating (you can open another instance of the REST API update index page and to see the progress in the ‘Resulting number of index documents’ table cell – refresh that page periodically to see how many are indexed)
After that you should be able to check the SOLR schema browser (http://fedorablahblahblah:8080/solr/admin/schema.jsp) and find your OCR field and see that it has the expected number of documents indexed.
Lastly, you need to edit solrconfig.xml to have searches actually check that field. In that file find the standard request handler and add the field name (ocr.OCR) to the list of default fields:
<requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="fl">*</str> <str name="q.alt">*:*</str> <str name="qf"> PID dc.title ... ocr.OCR ... </str> </lst> </requestHandler>
Resart fedora, and you should now be able to do basic searches in your islandora site for strings in the OCR-ed text and have the expected object be in the search results. I recommend also adjusting your islandora solr search settings to add the ocr field to your advanced search fields.