We know that computers understand programming languages but how about making them understand human language, the language that you and me speak? Natural Language Processing (NLP)...
By: Vijay Mhaskar | May 26, 2015
This article explains the format used for specifying the “Min Number Should Match” criteria of the BooleanQuery objects built by the DisMaxRequestHandler. Using this it is possible to specify a percentage of query words (or blocks) that should appear in a document.
There are 3 types of “clauses” that Solr (Lucene) knows about: mandatory, prohibited, and ‘optional’. By default all words or phrases specified in the “q” param are treated as “optional” clauses unless they are preceeded by a “+” or a “-“. When dealing with these “optional” clauses, the “mm” option makes it possible to say that a certain minimum number of those clauses must match (mm).
Specifying this minimum number can be done in complex ways, like…..
- At least 2 of the optional clauses must match, regardless of how many clauses there are: “2”.
- At least 75% of the optional clauses must match, rounded down: “75%”.
- If there are less than 3 optional clauses, they all must match; if there are 3 or more, then 75% must match, rounded up: “2<-25%”.
- If there are less than 3 optional clauses, they all must match; for 3 to 5 clauses, one less than the number of clauses must match, for 6 or more clauses, 80% must match, rounded down: “2<-1 5<80%”
- Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it. In this example: if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required: “2<-25% 9<-3″
A few important notes…
- When dealing with percentages, negative values can be used to get different behavior in edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required.
- No matter what number the calculation arrives at, a value greater than the number of optional clauses, or a value less than 1 will never be used.
- The lower the percentage, the more permutations of input terms there are that can produce matches, and the more documents that will match. In which case Solr by definition is doing more work.