Beider Morse Phonetic Matching in Solr
Phonetic Search Overview: Phonetic search is widely used to search phonetically equivalent search terms to the desired search term. – e.g. searching for AVANTAJE while term actually is ADVANTAGE. Solr has built-in support for various phonetic search algorithms.
- Beider-Morse Phonetic Matching (BMPM)
- Daitch-Mokotoff Soundex
- Double Metaphone
- Metaphone
- Soundex
- Refined Soundex
- Caverphone
- Kölner Phonetik a.k.a. Cologne Phonetic
- NYSIIS
Introduction to Beider Morse Phonetic Matching (BMPM)
In this post, we'll will provide an overview of Beider-Morse Phonetic Matching (BMPM) and how to configure it in Solr. BMPM was developed by Alexander Beider and Stephen P. Morse. It is superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone etc.
1. First BMPM attempts to determine the language of term. For that purpose It uses predefined rules to guess the possible languages that a word originates from. These rules are stored in UTF-8 encoded resource files. For example, if it ends in “ault” then it infers that the word is French. They are systematically named following the pattern, org/apache/commons/codec/language/bm/lang.txt.
The format of these resources is the following:
- Rules: whitespace separated strings. There should be 3 columns to each row, and these will be interpreted as:
- pattern: a regular expression.
- languages: a ‘+’-separated list of languages.
- acceptOnMatch: ‘true’ or ‘false’ indicating if a match rules in or rules out the language.
- End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
- Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
- Blank lines: All blank lines will be skipped.
2. After detecting language it applies phonetic rules for that particular language and transliterates the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Phonetic spellings of an input is represented using upper- and lower-case roman characters. If there are multiple possible phonetic representations, these are joined with a pipe (|) character.
The format of each entry rule in the table
- Rules: whitespace separated, double-quoted strings. There should be 4 columns to each row, and these will be interpreted as:
“Pattern” “left context” “right context” “phonetic”
“ex” “^” “” “(ex|eks[english]|iks[english])”
- pattern is a sequence of characters that might appear in the word to be transliterated
- left context is the context that precedes the pattern
- right context is the context that follows the pattern phonetic is the result that this rule generates
- End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
- Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
- Blank lines: All blank lines will be skipped.
3. Finally, it applies language-independent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.
Using BMPM in Solr: To use BMPM in Solr we need to add the following filter in Solr’s schema.xml.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto">
</filter>
</analyzer>
RuleType
- APPROX : Approximate rules, which will lead to the largest number of phonetic interpretations.
- EXACT : Exact rules, which will lead to a minimum number of phonetic interpretations.
NameType
- Supported types of names. Unless you are matching particular family names, use GENERIC. The GENERIC NameType should work reasonably well for non-name words. The other encodings are specifically tuned to family names, and may not work well at all for general text.
- ASHKENAZI (ash) : Ashkenazi family names.
- GENERIC (gen) : Generic names and words.
- SEPHARDIC (sep) : Sephardic family names.
Currently supported languages for Solr, BMPM searching:
- English
- French
- German
- Greek
- Hebrew written in Hebrew letters
- Hungarian
- Italian
- Lithuanian and Latvian
- Polish
- Romanian
- Russian written in Cyrillic letters
- Russian transliterated into English letters
- Spanish
- Turkish
Conclusion: Using phonetic search we can increase the recall but this will lower the precision. We can address this by ranking the results properly (correctly spelled results at the top of list) or in the user interface by providing options to limit phonetic search results.
Write a comment
- Vijay Mhaskar August 7, 2015, 8:23 amHere "lang" parameter specifies original language. If we do not provide this value then original language will be guessed. Increasing number of rules will surely increase recall for query which may add more processing at every stage in search.reply
- sachin shinde August 7, 2015, 8:15 amNice article...simplifying search..does it affects performance if more languages support includedreply
- yacon root quiz July 10, 2015, 1:35 amVery nice article, just what I needed.reply
- garcinia cambogia June 16, 2015, 8:14 pmSaved as a favorite, I really like your website!reply