Exactish phrase matching in Solr


By: Vijay Mhaskar | May 28, 2015

Phrase match

Simple way by which we an achieve exact matching in Solr is by using the default string type. It is exact phrase matching. string is a useful type for facet where we search the index by using the text pulled from the index itself.

Exactish phrase match

Most of the time while searching phrases we want Solr to ignore case, punctuation, whitespace or stemming etc. If someone types in a full query, but misses a bracket, in this case it should assume they want that particular item.

Solr’s default phrase matching doesn’t differentiate between a phrase that matches the whole target string and only part of that target string. For this, we’ll need a decent text fieldtype and a way to “anchor” the search to both ends of the target string.

Here we will create a text type that will only phrase match if the query string exactishly-matches the whole field. We’ll phrase-search on this field and boost it way up.

<fieldtype name="text_lr" class="solr.TextField" positionIncrementGap="1000">
    <analyzer>
      <charFilter class="solr.PatternReplaceCharFilterFactory"
        pattern="^(.*)$" replacement="AAAA $1 ZZZZ" />
      <tokenizer class="solr.ICUTokenizerFactory"/>
      <filter class="solr.ICUFoldingFilterFactory"/>
    </analyzer>
  </fieldtype>

Example :

Let’s index “Computer science in USA” in a normal text field. A normal solr phrase query q=”Science in USA” will match on that value, because the query phrase is fully contained in the indexed phrase. But what happens if we index into a text_lr field?

Indexing “Computer science in USA” becomes aaaa Computer science in USA zzzz .
Search terms “Science in USA” becomes aaaa Science in USA zzzz  . Then phrase searching will compare the two transformed values using normal Solr rules, and will find that they are not matching.

Things to remember

Here any non-phrase query will match every field that uses this fieldtype so use anchored fieldtypes for phrase queries only when you want exactish matches.

You can use other string instead of AAAA and ZZZZ which will not be part of your data. and by adding only one of ‘AAAA’ or ‘ZZZZ’, we can have left-anchored and right-anchored searches as well.

 

You can get more details http://robotlibrarian.billdueber.com/2012/03/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/

This post has been viewed 3,848 times

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>