Apache Solr: data processing pipeline

Index time processing of data is crucial for developing a search engine and its criticality depends on how sophisticated your search engine is. If your search engine is very basic, you still need to perform some processing at index time. Considering this, even the basic tokenized field type like text_general, in techproducts configset that comes bundled with Solr, applies StopwordFilterFactory to filters out all the unwanted words from your search application.

The content being indexed generally goes through a pipeline of preprocecessing and it can be for any of the below reasons.

  • Data cleansing: Remove noises, junks, stopwords, unknown characters, etc. They either have no meaning or limited meaning in the content and some if not removed can end up matching irrelevant documents.
  • Content extraction: Some search engine requires to extract text from binaries such as Word, PDF and it might totally depend of the processing pipeline for the extraction.
  • Content enrichment: What enrichment you do to the contentdepends on lot many factors like domain, requirement, problem area and lot more. For example, systems which extracts content from binary files also extracts the associated metadata, a product review engine might need to compute rating, popularity etc, a semantic engine might perform entity extraction etc.
  • Feedback loop: The feedback can be in form of click-thru pattern, trending count, etc and should go back in the system.

Solr provides multiple plug points for processing the content. It also provides an array of analysers, filters, processors and tokenizers out of the box and you can have your processing pipeline in place with few configuration changes and without writing a line of Java code. If the provided set of processors doesn’t satisfy your need, you can extend the Java classes to plugin your own component to the pipeline. Below are the available pluggable points in Solr.

Analyzer/Tokenizer: It makes a good place for light weight transformations like removing stopwords, stripping html contents, conversion to character to nearest ASCII, stemming etc. But this approach has few caveats. Firstly, the transformation is applied only to the index and is not stored. Hence, you can use it for matching and relevance ranking but cannot return the transformed result to the user. For example, a system might want to provide the user with retrieved entity information and use it not only for performing entity extraction in field analysis phase . Secondly, the processing is per field and if you have to apply the same filter on multipe field, you need to do the processing multiple times.

UpdateRequestProcessor: What makes a better solution is UpdateRequestProcessor. You can chain the processors together and register it to the update handler which will run for the update request and transforms the document while indexing and would also store the transformed data. Solr provides processors for deduplication, language detection, metadata extraction etc.

Transformer: If you are using data import handler, you can also use transformers which transform the document while importing and also stores the value.

If your processing is  heavier, components plugged inside Solr might slow down the indexer or eat Solr resources. And here what you need to look for is either your custom implementation or solutions like UIMA, OpenPipeline or Behemoth.

So make your choice based on what processing you do!

Write a comment
Cancel Reply
  • Dikshant Shahi May 6, 2015, 10:48 am
    Thanks Ram!
  • Ram May 6, 2015, 8:40 am
    Hi Dishant, Amazing content & very good insight ... its as informative as our session with you... Cheers !