Home » Language analysis in Solr – Ultimate Solr Guide

Language analysis in Solr – Ultimate Solr Guide

Hi All, today I’m presenting another post on solr pertaining to language analysis. In most business applications, there comes a scenario where the business needs to deal with data in multiple languages, the most common scenario being dealing with customers of different geographies. Solr helps to deal with multiple languages in a unique way.

This section contains information about tokenizers and filters related to character set conversion or for use with specific languages.

For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters.

In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words.

KeywordMarkerFilterFactory

Protects words from being modified by stemmers. A customized protected word list may be specified with the “protected” attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.

KeywordRepeatFilterFactory

Emits each token twice, one with the KEYWORD attribute and once without.

If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.

To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed.

A sample fieldType configuration could look like this:

StemmerOverrideFilterFactory

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the “dictionary” attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.

Dictionary Compound Word Token Filter

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

Compound words are most commonly found in Germanic languages.

Factory class: solr.DictionaryCompoundWordTokenFilterFactory

Arguments:

dictionary
(required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with “#” are ignored. This path may be an absolute path, or path relative to the Solr config directory.
minWordSize
(integer, default 5) Any token shorter than this is not decompounded.
minSubwordSize
(integer, default 2) Subwords shorter than this are not emitted as tokens.
maxSubwordSize
(integer, default 15) Subwords longer than this are not emitted as tokens.
onlyLongestMatch
(true/false) If true (the default), only the longest matching subwords will generate new tokens.

Example:

Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff

In: “Donaudampfschiff dummkopf”

Tokenizer to Filter: “Donaudampfschiff”(1), “dummkopf”(2),

Out: “Donaudampfschiff”(1), “Donau”(1), “dampf”(1), “schiff”(1), “dummkopf”(2), “dumm”(2), “kopf”(2)

Unicode Collation

Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes.

Unicode Collation in Solr is fast, because all the work is done at index time.

Rather than specifying an analyzer within <fieldtype …​ class="solr.TextField">, the solr.CollationField and solr.ICUCollationField field type classes provide this functionality. solr.ICUCollationField, which is backed by the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs solr.CollationField.

solr.ICUCollationField is included in the Solr analysis-extras contrib – see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib in order to use it.

solr.ICUCollationField and solr.CollationField fields can be created in two ways:

  • Based upon a system collator associated with a Locale.
  • Based upon a tailored RuleBasedCollator ruleset.

Arguments for solr.ICUCollationField, specified as attributes within the <fieldtype> element:

Using a System collator:

locale
(required) RFC 3066 locale ID. See the ICU locale explorer for a list of supported locales.
strength
Valid values are primarysecondarytertiaryquaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information.
decomposition
Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.

Using a Tailored ruleset:

custom
(required) Path to a UTF-8 text file containing rules supported by the ICU RuleBasedCollator
strength
Valid values are primarysecondarytertiaryquaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information.
decomposition
Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.

Expert options:

alternate
Valid values are shifted or non-ignorable. Can be used to ignore punctuation/whitespace.
caseLevel
(true/false) If true, in combination with strength="primary", accents are ignored but case is taken into account. The default is false. See CaseLevel in ICU Collation Concepts for more information.
caseFirst
Valid values are lower or upper. Useful to control which is sorted first when case is not ignored.
numeric
(true/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false.
variableTop
Single character or contraction. Controls what is variable for alternate.
So, this is it for now, will be back with another post very soon.