Beider Morse Phonetic Matching in Solr

Added:30 Apr 2015
Author:Vijay Mhaskar
Views:16690

Phonetic Search Overview: Phonetic search is widely used to search phonetically equivalent search terms to the desired search term. – e.g. searching for AVANTAJE while term actually is ADVANTAGE. Solr has built-in support for various phonetic search algorithms.

Beider-Morse Phonetic Matching (BMPM)
Daitch-Mokotoff Soundex
Double Metaphone
Metaphone
Soundex
Refined Soundex
Caverphone
Kölner Phonetik a.k.a. Cologne Phonetic
NYSIIS

Introduction to Beider Morse Phonetic Matching (BMPM)

In this post, we'll will provide an overview of Beider-Morse Phonetic Matching (BMPM) and how to configure it in Solr. BMPM was developed by Alexander Beider and Stephen P. Morse. It is superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone etc.

1. First BMPM attempts to determine the language of term. For that purpose It uses predefined rules to guess the possible languages that a word originates from. These rules are stored in UTF-8 encoded resource files. For example, if it ends in “ault” then it infers that the word is French. They are systematically named following the pattern, org/apache/commons/codec/language/bm/lang.txt.

The format of these resources is the following:

Rules: whitespace separated strings. There should be 3 columns to each row, and these will be interpreted as:

pattern: a regular expression.
languages: a ‘+’-separated list of languages.
acceptOnMatch: ‘true’ or ‘false’ indicating if a match rules in or rules out the language.

End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
Blank lines: All blank lines will be skipped.

2. After detecting language it applies phonetic rules for that particular language and transliterates the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Phonetic spellings of an input is represented using upper- and lower-case roman characters. If there are multiple possible phonetic representations, these are joined with a pipe (|) character.

The format of each entry rule in the table

Rules: whitespace separated, double-quoted strings. There should be 4 columns to each row, and these will be interpreted as:

“Pattern” “left context” “right context” “phonetic”
“ex” “^” “” “(ex|eks[english]|iks[english])”

pattern is a sequence of characters that might appear in the word to be transliterated
left context is the context that precedes the pattern
right context is the context that follows the pattern phonetic is the result that this rule generates
End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
Blank lines: All blank lines will be skipped.

3. Finally, it applies language-independent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.

Using BMPM in Solr: To use BMPM in Solr we need to add the following filter in Solr’s schema.xml.

</filter>

</analyzer>

RuleType

APPROX : Approximate rules, which will lead to the largest number of phonetic interpretations.
EXACT : Exact rules, which will lead to a minimum number of phonetic interpretations.

NameType

Supported types of names. Unless you are matching particular family names, use GENERIC. The GENERIC NameType should work reasonably well for non-name words. The other encodings are specifically tuned to family names, and may not work well at all for general text.
ASHKENAZI (ash) : Ashkenazi family names.
GENERIC (gen) : Generic names and words.
SEPHARDIC (sep) : Sephardic family names.

Currently supported languages for Solr, BMPM searching:

English
French
German
Greek
Hebrew written in Hebrew letters
Hungarian
Italian
Lithuanian and Latvian
Polish
Romanian
Russian written in Cyrillic letters
Russian transliterated into English letters
Spanish
Turkish

Conclusion: Using phonetic search we can increase the recall but this will lower the precision. We can address this by ranking the results properly (correctly spelled results at the top of list) or in the user interface by providing options to limit phonetic search results.

Tags: Beider Morse Phonetic Matching, Phonetic Search, Search Engine, Solr

Write a comment

Vijay Mhaskar August 7, 2015, 8:23 am
Here "lang" parameter specifies original language. If we do not provide this value then original language will be guessed. Increasing number of rules will surely increase recall for query which may add more processing at every stage in search.
reply
sachin shinde August 7, 2015, 8:15 am
Nice article...simplifying search..does it affects performance if more languages support included
reply
yacon root quiz July 10, 2015, 1:35 am
Very nice article, just what I needed.
reply
garcinia cambogia June 16, 2015, 8:14 pm
Saved as a favorite, I really like your website!
reply

Beider Morse Phonetic Matching in Solr

Write a comment

Search

Relevant Posts

Author's Recent Posts

Categories

Beider Morse Phonetic Matching in Solr

Write a comment

Search

Subscribe Us

Relevant Posts

Author's Recent Posts

Categories

Subscribe To Our Newsletter