Using Solr and TikaOCR to search text inside an image.


By: Vijay Mhaskar | July 17, 2015

 

Tesseract is probably the most accurate open source OCR engine available and with Apache Tika 1.7 you can now use the awesome Tesseract OCR parser within Tika!

Solr 5.x has support for Tika 1.7 (See this) . I wanted to try this in Solr 5.2 so I configured this on my machine, Below are the steps required to make TikaOCR work with Solr 5.2.

Tika OCR works OOTB in Solr we just need to install Tesseract and set class path accordingly. You can download Tesseract from here. This is native tool and it does the actual work. I have used version 3.02.02 in my case.

Next step is to enable extract request handler in solrconfig.xml .

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>

We will use below image for OCR.

OM_1

Now from Solr admin we can upload image file to Extracting request handler with following parameters. literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true

Sample Url is : http://{$solr_host}:{$solr_port}/solr/techproducts/update/extract?wt=json&literal.id=d1&uprefix=attr_&fmap.content=attr_content&commit=true

See below image for more information,
SolrImageExtract

Click Submit, If everything goes well this image will be indexed with id “d1″. we can query Solr for this id to see what Tika has extracted from image. it has extracted many features from uploaded image.

{  
   "responseHeader":{  
      "status":0,
      "QTime":1,
      "params":{  
         "q":"id:d1",
         "indent":"true",
         "fl":"*",
         "wt":"json"
      }
   },
   "response":{  
      "numFound":1,
      "start":0,
      "docs":[  
         {  
            "id":"d1",
            "attr_stream_size":[  
               "55422"
            ],
            "attr_x_parsed_by":[  
               "org.apache.tika.parser.DefaultParser",
               "org.apache.tika.parser.ocr.TesseractOCRParser",
               "org.apache.tika.parser.jpeg.JpegParser"
            ],
            "attr_stream_content_type":[  
               "image/jpeg"
            ],
            "attr_resolution_units":[  
               "inch"
            ],
            "attr_stream_source_info":[  
               "the-file"
            ],
            "attr_compression_type":[  
               "Progressive, Huffman"
            ],
            "attr_data_precision":[  
               "8 bits"
            ],
            "attr_number_of_components":[  
               "3"
            ],
            "attr_tiff_imagelength":[  
               "286"
            ],
            "attr_component_2":[  
               "Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"
            ],
            "attr_component_1":[  
               "Y component: Quantization table 0, Sampling factors 2 horiz/2 vert"
            ],
            "attr_image_height":[  
               "286 pixels"
            ],
            "attr_x_resolution":[  
               "72 dots"
            ],
            "attr_image_width":[  
               "690 pixels"
            ],
            "attr_stream_name":[  
               "OM_1.jpg"
            ],
            "attr_component_3":[  
               "Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"
            ],
            "attr_tiff_bitspersample":[  
               "8"
            ],
            "attr_tiff_imagewidth":[  
               "690"
            ],
            "content_type":[  
               "image/jpeg"
            ],
            "attr_y_resolution":[  
               "72 dots"
            ],
            "attr_content":[  
               " \n \n \n \n \n \n \n \n \n \n \n ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was visiting a.\ncertain public school, a school set in a typically English\ncountryside, which on the June clay of my visit was wonder-\nfully beauliful. The Head Master—-no less typical than his\nschool and the country-side—pointed out the charms of\nboth, and his pride came out in the final remark which he made\nbeforehe left me. He explained that he had a class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can you\n\n, conceive anything more delightful than a class in Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n "
            ],
            "_version_":1506952201156689920
         }
      ]
   }
}

Here in above response if we see “attr_content” which is the actual field where OCR output is stored .

“attr_content”:[
” \n \n \n \n \n \n \n \n \n \n \n ‘ ‘\”I“ \” \”‘ ./\nlrast. Shortly before the classes started I was visiting a.\ncertain public school, a school set in a typically English\ncountryside, which on the June clay of my visit was wonder-\nfully beauliful. The Head Master—-no less typical than his\nschool and the country-side—pointed out the charms of\nboth, and his pride came out in the final remark which he made\nbeforehe left me. He explained that he had a class to take\nin’I’heocritus. Then (with a. buoyant gesture); “ Can you\n\n, conceive anything more delightful than a class in Theocritus,\n\non such a day and in such a place?\”\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ”
]

These results are very much accurate.  If in case results are not so accurate then we can use some context-based or linguistic error correction algorithm to clean/correct text.

This post has been viewed 15,699 times

One thought on “Using Solr and TikaOCR to search text inside an image.

  1. Brian Hagan

    Thanks Vijay, I used this post successfully. The tricky part for me was building leptonica on Centos due to pathing and the required dependencies.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>