Beider Morse Phonetic Matching in Solr


By: Vijay Mhaskar | April 30, 2015

Phonetic search overview

Phonetic search is widely used to search phonetically equivalent search terms to the desired search term. – e.g. searching for AVANTAJE while term actually is AVANTAGE. Solr has built in support for various phonetic search algorithms.

  1. Beider-Morse Phonetic Matching (BMPM)
  2. Daitch-Mokotoff Soundex
  3. Double Metaphone
  4. Metaphone
  5. Soundex
  6. Refined Soundex
  7. Caverphone
  8. Kölner Phonetik a.k.a. Cologne Phonetic
  9. NYSIIS

Introduction to Beider Morse Phonetic Matching (BMPM)

In this post I will provide overview of Beider-Morse Phonetic Matching (BMPM) and how to configure it in Solr. BMPM was developed by Alexander Beider and Stephen P. Morse. It is superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone etc.

  1. First BMPM attempts to determine the language of term. For that purpose It uses predefined rules to guess the possible languages that a word originates from. These rules are stored in UTF-8 encoded resource files. For example, if it ends in “ault” then it infers that the word is French. They are systematically named following the pattern, org/apache/commons/codec/language/bm/lang.txt.

The format of these resources is the following:

  • Rules: whitespace separated strings. There should be 3 columns to each row, and these will be interpreted as:
    • pattern: a regular expression.
    • languages: a ‘+’-separated list of languages.
    • acceptOnMatch: ‘true’ or ‘false’ indicating if a match rules in or rules out the language.
  • End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
  • Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
  • Blank lines: All blank lines will be skipped.

2. After detecting language it applies phonetic rules for that particular language and transliterates the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Phonetic spellings of an input is represented using upper- and lower-case roman characters. If there are multiple possible phonetic representations, these are joined with a pipe (|) character.

Format of each entry rule in the table

  • Rules: whitespace separated, double-quoted strings. There should be 4 columns to each row, and these will be interpreted as:
           “Pattern” “left context” “right context” “phonetic”
           “ex” “^” “” “(ex|eks[english]|iks[english])”
  • pattern is a sequence of characters that might appear in the word to be transliterated
  • left context is the context that precedes the pattern
  • right context is the context that follows the pattern phonetic is the result that this rule generates
  • End-of-line comments: Any occurrence of ‘//’ will cause all text following on that line to be discarded as a comment.
  • Multi-line comments: Any line starting with ‘/*’ will start multi-line commenting mode. This will skip all content until a line ending in ‘*’ and ‘/’ is found.
  • Blank lines: All blank lines will be skipped.

3. Finally, it applies language-independent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.

Using BMPM in Solr

To use BMPM in Solr we need to add following filter in Solr’s schema.xml.

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto">
</filter>
</analyzer>

 RuleType

  • APPROX : Approximate rules, which will lead to the largest number of phonetic interpretations.
  • EXACT : Exact rules, which will lead to a minimum number of phonetic interpretations.

NameType

  • Supported types of names. Unless you are matching particular family names, use GENERIC. The GENERIC NameType should work reasonably well for non-name words. The other encodings are specifically tuned to family names, and may not work well at all for general text.
  • ASHKENAZI (ash) : Ashkenazi family names.
  • GENERIC (gen) : Generic names and words.
  • SEPHARDIC (sep) : Sephardic family names.

Currently supported languages

For Solr, BMPM searching is available for the following languages:

  • English
  • French
  • German
  • Greek
  • Hebrew written in Hebrew letters
  • Hungarian
  • Italian
  • Lithuanian and Latvian
  • Polish
  • Romanian
  • Russian written in Cyrillic letters
  • Russian transliterated into English letters
  • Spanish
  • Turkish

 Conclusion

Using phonetic search we can increase the recall but this will lower the precision. We can address this by ranking the results properly (correctly spelled results at the top of list) or in the user interface by providing options to limit phonetic search results.

This post has been viewed 3,040 times

4 thoughts on “Beider Morse Phonetic Matching in Solr

    1. Vijay Mhaskar Post author

      Here “lang” parameter specifies original language. If we do not provide this value then original language will be guessed. Increasing number of rules will surely increase recall for query which may add more processing at every stage in search.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>