{"id":775,"date":"2020-04-04T17:06:26","date_gmt":"2020-04-04T17:06:26","guid":{"rendered":"https:\/\/www.aeologic.com\/blog\/?p=775"},"modified":"2020-04-04T17:15:35","modified_gmt":"2020-04-04T17:15:35","slug":"language-analysis-in-solr-ultimate-solr-guide","status":"publish","type":"post","link":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/","title":{"rendered":"Language analysis in Solr &#8211; Ultimate Solr Guide"},"content":{"rendered":"<p>Hi All, today I&#8217;m presenting another post on solr pertaining to language analysis. In most business applications, there comes a scenario where the business needs to deal with data in multiple languages, the most common scenario being dealing with customers of different geographies. Solr helps to deal with multiple languages in a unique way.<\/p>\n<div class=\"paragraph\">\n<p>This section contains information about tokenizers and filters related to character set conversion or for use with specific languages.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and\/or a relatively small set of punctuation characters.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words.<\/p>\n<h2 id=\"LanguageAnalysis-KeywordMarkerFilterFactory\" class=\"clickable-header top-level-header\">KeywordMarkerFilterFactory<\/h2>\n<p>Protects words from being modified by stemmers. A customized protected word list may be specified with the &#8220;protected&#8221; attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone wp-image-776\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98.png\" alt=\"\" width=\"662\" height=\"251\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98.png 1648w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-300x114.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-768x292.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-1024x389.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-720x273.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-1180x448.png 1180w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-260x99.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-211x80.png 211w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-98-250x95.png 250w\" sizes=\"(max-width: 662px) 100vw, 662px\" \/><\/p>\n<h2 id=\"LanguageAnalysis-KeywordRepeatFilterFactory\" class=\"clickable-header top-level-header\">KeywordRepeatFilterFactory<\/h2>\n<div class=\"paragraph\">\n<p>Emits each token twice, one with the\u00a0<code>KEYWORD<\/code>\u00a0attribute and once without.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>To configure, add the\u00a0<code>KeywordRepeatFilterFactory<\/code>\u00a0early in the analysis chain. It is recommended to also include\u00a0<code>RemoveDuplicatesTokenFilterFactory<\/code>\u00a0to avoid duplicates when tokens are not stemmed.<\/p>\n<p>A sample fieldType configuration could look like this:<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-777\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99.png\" alt=\"\" width=\"681\" height=\"331\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99.png 681w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99-300x146.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99-260x126.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99-165x80.png 165w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-99-250x122.png 250w\" sizes=\"(max-width: 681px) 100vw, 681px\" \/><\/p>\n<h2 id=\"LanguageAnalysis-StemmerOverrideFilterFactory\" class=\"clickable-header top-level-header\">StemmerOverrideFilterFactory<\/h2>\n<div class=\"paragraph\">\n<p>Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>A customized mapping of words to stems, in a tab-separated file, can be specified to the &#8220;dictionary&#8221; attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-778\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100.png\" alt=\"\" width=\"842\" height=\"313\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100.png 842w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-300x112.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-768x285.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-720x268.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-260x97.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-215x80.png 215w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-100-250x93.png 250w\" sizes=\"(max-width: 842px) 100vw, 842px\" \/><\/p>\n<h2 id=\"LanguageAnalysis-DictionaryCompoundWordTokenFilter\" class=\"clickable-header top-level-header\">Dictionary Compound Word Token Filter<\/h2>\n<div class=\"paragraph\">\n<p>This filter splits, or\u00a0<em>decompounds<\/em>, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Compound words are most commonly found in Germanic languages.<\/p>\n<div class=\"paragraph\">\n<p><strong>Factory class:<\/strong>\u00a0<code>solr.DictionaryCompoundWordTokenFilterFactory<\/code><\/p>\n<\/div>\n<div class=\"paragraph\">\n<p><strong>Arguments:<\/strong><\/p>\n<\/div>\n<div class=\"dlist\">\n<dl>\n<dt class=\"hdlist1\"><code>dictionary<\/code><\/dt>\n<dd>(required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with &#8220;#&#8221; are ignored. This path may be an absolute path, or path relative to the Solr config directory.<\/dd>\n<dt class=\"hdlist1\"><code>minWordSize<\/code><\/dt>\n<dd>(integer, default 5) Any token shorter than this is not decompounded.<\/dd>\n<dt class=\"hdlist1\"><code>minSubwordSize<\/code><\/dt>\n<dd>(integer, default 2) Subwords shorter than this are not emitted as tokens.<\/dd>\n<dt class=\"hdlist1\"><code>maxSubwordSize<\/code><\/dt>\n<dd>(integer, default 15) Subwords longer than this are not emitted as tokens.<\/dd>\n<dt class=\"hdlist1\"><code>onlyLongestMatch<\/code><\/dt>\n<dd>(true\/false) If true (the default), only the longest matching subwords will generate new tokens.<\/dd>\n<\/dl>\n<div class=\"paragraph\">\n<p><strong>Example:<\/strong><\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Assume that\u00a0<code>germanwords.txt<\/code>\u00a0contains at least the following words:\u00a0<code>dumm kopf donau dampf schiff<\/code><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-779\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317.png\" alt=\"\" width=\"943\" height=\"259\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317.png 943w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-300x82.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-768x211.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-720x198.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-260x71.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-291x80.png 291w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/carbon-2020-04-04T223054.317-250x69.png 250w\" sizes=\"(max-width: 943px) 100vw, 943px\" \/><\/p>\n<div class=\"paragraph\">\n<p><strong>In:<\/strong>\u00a0&#8220;Donaudampfschiff dummkopf&#8221;<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p><strong>Tokenizer to Filter:<\/strong>\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;dummkopf&#8221;(2),<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p><strong>Out:<\/strong>\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;Donau&#8221;(1), &#8220;dampf&#8221;(1), &#8220;schiff&#8221;(1), &#8220;dummkopf&#8221;(2), &#8220;dumm&#8221;(2), &#8220;kopf&#8221;(2)<\/p>\n<h2 id=\"LanguageAnalysis-UnicodeCollation\" class=\"clickable-header top-level-header\">Unicode Collation<\/h2>\n<div class=\"sectionbody\">\n<div class=\"paragraph\">\n<p>Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Unicode Collation in Solr is fast, because all the work is done at index time.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Rather than specifying an analyzer within\u00a0<code>&lt;fieldtype \u2026\u200b class=\"solr.TextField\"&gt;<\/code>, the\u00a0<code>solr.CollationField<\/code>\u00a0and\u00a0<code>solr.ICUCollationField<\/code>\u00a0field type classes provide this functionality.\u00a0<code>solr.ICUCollationField<\/code>, which is backed by\u00a0<a href=\"http:\/\/site.icu-project.org\/\">the ICU4J library<\/a>, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs\u00a0<code>solr.CollationField<\/code>.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p><code>solr.ICUCollationField<\/code>\u00a0is included in the Solr\u00a0<code>analysis-extras<\/code>\u00a0contrib &#8211; see\u00a0<code>solr\/contrib\/analysis-extras\/README.txt<\/code>\u00a0for instructions on which jars you need to add to your\u00a0<code>SOLR_HOME\/lib<\/code>\u00a0in order to use it.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p><code>solr.ICUCollationField<\/code>\u00a0and\u00a0<code>solr.CollationField<\/code>\u00a0fields can be created in two ways:<\/p>\n<\/div>\n<div class=\"ulist\">\n<ul>\n<li>Based upon a system collator associated with a Locale.<\/li>\n<li>Based upon a tailored\u00a0<code>RuleBasedCollator<\/code>\u00a0ruleset.<\/li>\n<\/ul>\n<\/div>\n<div class=\"paragraph\">\n<p><strong>Arguments for\u00a0<code>solr.ICUCollationField<\/code>, specified as attributes within the\u00a0<code>&lt;fieldtype&gt;<\/code>\u00a0element:<\/strong><\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Using a System collator:<\/p>\n<\/div>\n<div class=\"dlist\">\n<dl>\n<dt class=\"hdlist1\"><code>locale<\/code><\/dt>\n<dd>(required)\u00a0<a href=\"http:\/\/www.rfc-editor.org\/rfc\/rfc3066.txt\">RFC 3066<\/a>\u00a0locale ID. See\u00a0<a href=\"http:\/\/demo.icu-project.org\/icu-bin\/locexp\">the ICU locale explorer<\/a>\u00a0for a list of supported locales.<\/dd>\n<dt class=\"hdlist1\"><code>strength<\/code><\/dt>\n<dd>Valid values are\u00a0<code>primary<\/code>,\u00a0<code>secondary<\/code>,\u00a0<code>tertiary<\/code>,\u00a0<code>quaternary<\/code>, or\u00a0<code>identical<\/code>. See\u00a0<a href=\"http:\/\/userguide.icu-project.org\/collation\/concepts#TOC-Comparison-Levels\">Comparison Levels in ICU Collation Concepts<\/a>\u00a0for more information.<\/dd>\n<dt class=\"hdlist1\"><code>decomposition<\/code><\/dt>\n<dd>Valid values are\u00a0<code>no<\/code>\u00a0or\u00a0<code>canonical<\/code>. See\u00a0<a href=\"http:\/\/userguide.icu-project.org\/collation\/concepts#TOC-Normalization\">Normalization in ICU Collation Concepts<\/a>\u00a0for more information.<\/dd>\n<\/dl>\n<\/div>\n<div class=\"paragraph\">\n<p>Using a Tailored ruleset:<\/p>\n<\/div>\n<div class=\"dlist\">\n<dl>\n<dt class=\"hdlist1\"><code>custom<\/code><\/dt>\n<dd>(required) Path to a UTF-8 text file containing rules supported by the ICU\u00a0<a href=\"http:\/\/icu-project.org\/apiref\/icu4j\/com\/ibm\/icu\/text\/RuleBasedCollator.html\"><code>RuleBasedCollator<\/code><\/a><\/dd>\n<dt class=\"hdlist1\"><code>strength<\/code><\/dt>\n<dd>Valid values are\u00a0<code>primary<\/code>,\u00a0<code>secondary<\/code>,\u00a0<code>tertiary<\/code>,\u00a0<code>quaternary<\/code>, or\u00a0<code>identical<\/code>. See\u00a0<a href=\"http:\/\/userguide.icu-project.org\/collation\/concepts#TOC-Comparison-Levels\">Comparison Levels in ICU Collation Concepts<\/a>\u00a0for more information.<\/dd>\n<dt class=\"hdlist1\"><code>decomposition<\/code><\/dt>\n<dd>Valid values are\u00a0<code>no<\/code>\u00a0or\u00a0<code>canonical<\/code>. See\u00a0<a href=\"http:\/\/userguide.icu-project.org\/collation\/concepts#TOC-Normalization\">Normalization in ICU Collation Concepts<\/a>\u00a0for more information.<\/dd>\n<\/dl>\n<\/div>\n<div class=\"paragraph\">\n<p>Expert options:<\/p>\n<\/div>\n<div class=\"dlist\">\n<dl>\n<dt class=\"hdlist1\"><code>alternate<\/code><\/dt>\n<dd>Valid values are\u00a0<code>shifted<\/code>\u00a0or\u00a0<code>non-ignorable<\/code>. Can be used to ignore punctuation\/whitespace.<\/dd>\n<dt class=\"hdlist1\"><code>caseLevel<\/code><\/dt>\n<dd>(true\/false) If true, in combination with\u00a0<code>strength=\"primary\"<\/code>, accents are ignored but case is taken into account. The default is false. See\u00a0<a href=\"http:\/\/userguide.icu-project.org\/collation\/concepts#TOC-CaseLevel\">CaseLevel in ICU Collation Concepts<\/a>\u00a0for more information.<\/dd>\n<dt class=\"hdlist1\"><code>caseFirst<\/code><\/dt>\n<dd>Valid values are\u00a0<code>lower<\/code>\u00a0or\u00a0<code>upper<\/code>. Useful to control which is sorted first when case is not ignored.<\/dd>\n<dt class=\"hdlist1\"><code>numeric<\/code><\/dt>\n<dd>(true\/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false.<\/dd>\n<dt class=\"hdlist1\"><code>variableTop<\/code><\/dt>\n<dd>Single character or contraction. Controls what is variable for\u00a0<code>alternate<\/code>.<\/dd>\n<\/dl>\n<\/div>\n<div><\/div>\n<div class=\"sect2\">So, this is it for now, will be back with another post very soon.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Hi All, today I&#8217;m presenting another post on solr pertaining to language analysis. In most business applications, there comes a scenario where the business needs to deal with data in multiple languages, the most common scenario being dealing with customers of different geographies. Solr helps to deal with multiple languages in a unique way. This section contains information about tokenizers and filters related to character set conversion or for use with specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and\/or a relatively small set of punctuation characters. In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words. KeywordMarkerFilterFactory Protects words from being modified by stemmers. A customized protected word list may be specified with the &#8220;protected&#8221; attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr. KeywordRepeatFilterFactory Emits each token twice, one with the\u00a0KEYWORD\u00a0attribute and once without. If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected. To configure, add the\u00a0KeywordRepeatFilterFactory\u00a0early in the analysis chain. It is recommended to also include\u00a0RemoveDuplicatesTokenFilterFactory\u00a0to avoid duplicates when tokens are not stemmed. A sample fieldType configuration could look like this: StemmerOverrideFilterFactory Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers. A customized mapping of words to stems, in a tab-separated file, can be specified to the &#8220;dictionary&#8221; attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. Dictionary Compound Word Token Filter This filter splits, or\u00a0decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position. Compound words are most commonly found in Germanic languages. Factory class:\u00a0solr.DictionaryCompoundWordTokenFilterFactory Arguments: dictionary (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with &#8220;#&#8221; are ignored. This path may be an absolute path, or path relative to the Solr config directory. minWordSize (integer, default 5) Any token shorter than this is not decompounded. minSubwordSize (integer, default 2) Subwords shorter than this are not emitted as tokens. maxSubwordSize (integer, default 15) Subwords longer than this are not emitted as tokens. onlyLongestMatch (true\/false) If true (the default), only the longest matching subwords will generate new tokens. Example: Assume that\u00a0germanwords.txt\u00a0contains at least the following words:\u00a0dumm kopf donau dampf schiff In:\u00a0&#8220;Donaudampfschiff dummkopf&#8221; Tokenizer to Filter:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;dummkopf&#8221;(2), Out:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;Donau&#8221;(1), &#8220;dampf&#8221;(1), &#8220;schiff&#8221;(1), &#8220;dummkopf&#8221;(2), &#8220;dumm&#8221;(2), &#8220;kopf&#8221;(2) Unicode Collation Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes. Unicode Collation in Solr is fast, because all the work is done at index time. Rather than specifying an analyzer within\u00a0&lt;fieldtype \u2026\u200b class=&#8221;solr.TextField&#8221;&gt;, the\u00a0solr.CollationField\u00a0and\u00a0solr.ICUCollationField\u00a0field type classes provide this functionality.\u00a0solr.ICUCollationField, which is backed by\u00a0the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs\u00a0solr.CollationField. solr.ICUCollationField\u00a0is included in the Solr\u00a0analysis-extras\u00a0contrib &#8211; see\u00a0solr\/contrib\/analysis-extras\/README.txt\u00a0for instructions on which jars you need to add to your\u00a0SOLR_HOME\/lib\u00a0in order to use it. solr.ICUCollationField\u00a0and\u00a0solr.CollationField\u00a0fields can be created in two ways: Based upon a system collator associated with a Locale. Based upon a tailored\u00a0RuleBasedCollator\u00a0ruleset. Arguments for\u00a0solr.ICUCollationField, specified as attributes within the\u00a0&lt;fieldtype&gt;\u00a0element: Using a System collator: locale (required)\u00a0RFC 3066\u00a0locale ID. See\u00a0the ICU locale explorer\u00a0for a list of supported locales. strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Using a Tailored ruleset: custom (required) Path to a UTF-8 text file containing rules supported by the ICU\u00a0RuleBasedCollator strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Expert options: alternate Valid values are\u00a0shifted\u00a0or\u00a0non-ignorable. Can be used to ignore punctuation\/whitespace. caseLevel (true\/false) If true, in combination with\u00a0strength=&#8221;primary&#8221;, accents are ignored but case is taken into account. The default is false. See\u00a0CaseLevel in ICU Collation Concepts\u00a0for more information. caseFirst Valid values are\u00a0lower\u00a0or\u00a0upper. Useful to control which is sorted first when case is not ignored. numeric (true\/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false. variableTop Single character or contraction. Controls what is variable for\u00a0alternate. So, this is it for now, will be back with another post very soon.<\/p>\n","protected":false},"author":3,"featured_media":782,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[41],"tags":[102],"class_list":["post-775","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-solr","tag-language-detection-solr"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog\" \/>\n<meta property=\"og:description\" content=\"Hi All, today I&#8217;m presenting another post on solr pertaining to language analysis. In most business applications, there comes a scenario where the business needs to deal with data in multiple languages, the most common scenario being dealing with customers of different geographies. Solr helps to deal with multiple languages in a unique way. This section contains information about tokenizers and filters related to character set conversion or for use with specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and\/or a relatively small set of punctuation characters. In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words. KeywordMarkerFilterFactory Protects words from being modified by stemmers. A customized protected word list may be specified with the &#8220;protected&#8221; attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr. KeywordRepeatFilterFactory Emits each token twice, one with the\u00a0KEYWORD\u00a0attribute and once without. If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected. To configure, add the\u00a0KeywordRepeatFilterFactory\u00a0early in the analysis chain. It is recommended to also include\u00a0RemoveDuplicatesTokenFilterFactory\u00a0to avoid duplicates when tokens are not stemmed. A sample fieldType configuration could look like this: StemmerOverrideFilterFactory Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers. A customized mapping of words to stems, in a tab-separated file, can be specified to the &#8220;dictionary&#8221; attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. Dictionary Compound Word Token Filter This filter splits, or\u00a0decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position. Compound words are most commonly found in Germanic languages. Factory class:\u00a0solr.DictionaryCompoundWordTokenFilterFactory Arguments: dictionary (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with &#8220;#&#8221; are ignored. This path may be an absolute path, or path relative to the Solr config directory. minWordSize (integer, default 5) Any token shorter than this is not decompounded. minSubwordSize (integer, default 2) Subwords shorter than this are not emitted as tokens. maxSubwordSize (integer, default 15) Subwords longer than this are not emitted as tokens. onlyLongestMatch (true\/false) If true (the default), only the longest matching subwords will generate new tokens. Example: Assume that\u00a0germanwords.txt\u00a0contains at least the following words:\u00a0dumm kopf donau dampf schiff In:\u00a0&#8220;Donaudampfschiff dummkopf&#8221; Tokenizer to Filter:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;dummkopf&#8221;(2), Out:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;Donau&#8221;(1), &#8220;dampf&#8221;(1), &#8220;schiff&#8221;(1), &#8220;dummkopf&#8221;(2), &#8220;dumm&#8221;(2), &#8220;kopf&#8221;(2) Unicode Collation Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes. Unicode Collation in Solr is fast, because all the work is done at index time. Rather than specifying an analyzer within\u00a0&lt;fieldtype \u2026\u200b class=&quot;solr.TextField&quot;&gt;, the\u00a0solr.CollationField\u00a0and\u00a0solr.ICUCollationField\u00a0field type classes provide this functionality.\u00a0solr.ICUCollationField, which is backed by\u00a0the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs\u00a0solr.CollationField. solr.ICUCollationField\u00a0is included in the Solr\u00a0analysis-extras\u00a0contrib &#8211; see\u00a0solr\/contrib\/analysis-extras\/README.txt\u00a0for instructions on which jars you need to add to your\u00a0SOLR_HOME\/lib\u00a0in order to use it. solr.ICUCollationField\u00a0and\u00a0solr.CollationField\u00a0fields can be created in two ways: Based upon a system collator associated with a Locale. Based upon a tailored\u00a0RuleBasedCollator\u00a0ruleset. Arguments for\u00a0solr.ICUCollationField, specified as attributes within the\u00a0&lt;fieldtype&gt;\u00a0element: Using a System collator: locale (required)\u00a0RFC 3066\u00a0locale ID. See\u00a0the ICU locale explorer\u00a0for a list of supported locales. strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Using a Tailored ruleset: custom (required) Path to a UTF-8 text file containing rules supported by the ICU\u00a0RuleBasedCollator strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Expert options: alternate Valid values are\u00a0shifted\u00a0or\u00a0non-ignorable. Can be used to ignore punctuation\/whitespace. caseLevel (true\/false) If true, in combination with\u00a0strength=&quot;primary&quot;, accents are ignored but case is taken into account. The default is false. See\u00a0CaseLevel in ICU Collation Concepts\u00a0for more information. caseFirst Valid values are\u00a0lower\u00a0or\u00a0upper. Useful to control which is sorted first when case is not ignored. numeric (true\/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false. variableTop Single character or contraction. Controls what is variable for\u00a0alternate. So, this is it for now, will be back with another post very soon.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"Aeologic Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/AeoLogicTech\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-04-04T17:06:26+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-04-04T17:15:35+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"622\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Manoj Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@aeologictech\" \/>\n<meta name=\"twitter:site\" content=\"@aeologictech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Manoj Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":[\"Article\",\"BlogPosting\"],\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\"},\"author\":{\"name\":\"Manoj Kumar\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4\"},\"headline\":\"Language analysis in Solr &#8211; Ultimate Solr Guide\",\"datePublished\":\"2020-04-04T17:06:26+00:00\",\"dateModified\":\"2020-04-04T17:15:35+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\"},\"wordCount\":843,\"publisher\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png\",\"keywords\":[\"language detection solr\"],\"articleSection\":[\"Solr\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\",\"url\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\",\"name\":\"Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog\",\"isPartOf\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png\",\"datePublished\":\"2020-04-04T17:06:26+00:00\",\"dateModified\":\"2020-04-04T17:15:35+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage\",\"url\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png\",\"contentUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png\",\"width\":1080,\"height\":622},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.aeologic.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Language analysis in Solr &#8211; Ultimate Solr Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#website\",\"url\":\"https:\/\/www.aeologic.com\/blog\/\",\"name\":\"Aeologic Blog\",\"description\":\"Aeologic\",\"publisher\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.aeologic.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\",\"name\":\"AeoLogic Technologies\",\"url\":\"https:\/\/www.aeologic.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg\",\"contentUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg\",\"width\":385,\"height\":162,\"caption\":\"AeoLogic Technologies\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/AeoLogicTech\/\",\"https:\/\/x.com\/aeologictech\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4\",\"name\":\"Manoj Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g\",\"caption\":\"Manoj Kumar\"},\"description\":\"Manoj Kumar is a seasoned Digital Marketing Manager and passionate Tech Blogger with deep expertise in SEO, AI trends, and emerging digital technologies. He writes about innovative solutions that drive growth and transformation across industry. Featured on - YOURSTORY | TECHSLING | ELEARNINGINDUSTRY | DATASCIENCECENTRAL | TIMESOFINDIA | MEDIUM | DATAFLOQ\",\"sameAs\":[\"https:\/\/www.aeologic.com\/\",\"https:\/\/www.linkedin.com\/in\/manoj-kumar-rajput\/\"],\"url\":\"https:\/\/www.aeologic.com\/blog\/author\/manoj\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/","og_locale":"en_US","og_type":"article","og_title":"Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog","og_description":"Hi All, today I&#8217;m presenting another post on solr pertaining to language analysis. In most business applications, there comes a scenario where the business needs to deal with data in multiple languages, the most common scenario being dealing with customers of different geographies. Solr helps to deal with multiple languages in a unique way. This section contains information about tokenizers and filters related to character set conversion or for use with specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and\/or a relatively small set of punctuation characters. In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words. KeywordMarkerFilterFactory Protects words from being modified by stemmers. A customized protected word list may be specified with the &#8220;protected&#8221; attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr. KeywordRepeatFilterFactory Emits each token twice, one with the\u00a0KEYWORD\u00a0attribute and once without. If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected. To configure, add the\u00a0KeywordRepeatFilterFactory\u00a0early in the analysis chain. It is recommended to also include\u00a0RemoveDuplicatesTokenFilterFactory\u00a0to avoid duplicates when tokens are not stemmed. A sample fieldType configuration could look like this: StemmerOverrideFilterFactory Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers. A customized mapping of words to stems, in a tab-separated file, can be specified to the &#8220;dictionary&#8221; attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. Dictionary Compound Word Token Filter This filter splits, or\u00a0decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position. Compound words are most commonly found in Germanic languages. Factory class:\u00a0solr.DictionaryCompoundWordTokenFilterFactory Arguments: dictionary (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with &#8220;#&#8221; are ignored. This path may be an absolute path, or path relative to the Solr config directory. minWordSize (integer, default 5) Any token shorter than this is not decompounded. minSubwordSize (integer, default 2) Subwords shorter than this are not emitted as tokens. maxSubwordSize (integer, default 15) Subwords longer than this are not emitted as tokens. onlyLongestMatch (true\/false) If true (the default), only the longest matching subwords will generate new tokens. Example: Assume that\u00a0germanwords.txt\u00a0contains at least the following words:\u00a0dumm kopf donau dampf schiff In:\u00a0&#8220;Donaudampfschiff dummkopf&#8221; Tokenizer to Filter:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;dummkopf&#8221;(2), Out:\u00a0&#8220;Donaudampfschiff&#8221;(1), &#8220;Donau&#8221;(1), &#8220;dampf&#8221;(1), &#8220;schiff&#8221;(1), &#8220;dummkopf&#8221;(2), &#8220;dumm&#8221;(2), &#8220;kopf&#8221;(2) Unicode Collation Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes. Unicode Collation in Solr is fast, because all the work is done at index time. Rather than specifying an analyzer within\u00a0&lt;fieldtype \u2026\u200b class=\"solr.TextField\"&gt;, the\u00a0solr.CollationField\u00a0and\u00a0solr.ICUCollationField\u00a0field type classes provide this functionality.\u00a0solr.ICUCollationField, which is backed by\u00a0the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs\u00a0solr.CollationField. solr.ICUCollationField\u00a0is included in the Solr\u00a0analysis-extras\u00a0contrib &#8211; see\u00a0solr\/contrib\/analysis-extras\/README.txt\u00a0for instructions on which jars you need to add to your\u00a0SOLR_HOME\/lib\u00a0in order to use it. solr.ICUCollationField\u00a0and\u00a0solr.CollationField\u00a0fields can be created in two ways: Based upon a system collator associated with a Locale. Based upon a tailored\u00a0RuleBasedCollator\u00a0ruleset. Arguments for\u00a0solr.ICUCollationField, specified as attributes within the\u00a0&lt;fieldtype&gt;\u00a0element: Using a System collator: locale (required)\u00a0RFC 3066\u00a0locale ID. See\u00a0the ICU locale explorer\u00a0for a list of supported locales. strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Using a Tailored ruleset: custom (required) Path to a UTF-8 text file containing rules supported by the ICU\u00a0RuleBasedCollator strength Valid values are\u00a0primary,\u00a0secondary,\u00a0tertiary,\u00a0quaternary, or\u00a0identical. See\u00a0Comparison Levels in ICU Collation Concepts\u00a0for more information. decomposition Valid values are\u00a0no\u00a0or\u00a0canonical. See\u00a0Normalization in ICU Collation Concepts\u00a0for more information. Expert options: alternate Valid values are\u00a0shifted\u00a0or\u00a0non-ignorable. Can be used to ignore punctuation\/whitespace. caseLevel (true\/false) If true, in combination with\u00a0strength=\"primary\", accents are ignored but case is taken into account. The default is false. See\u00a0CaseLevel in ICU Collation Concepts\u00a0for more information. caseFirst Valid values are\u00a0lower\u00a0or\u00a0upper. Useful to control which is sorted first when case is not ignored. numeric (true\/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The default is false. variableTop Single character or contraction. Controls what is variable for\u00a0alternate. So, this is it for now, will be back with another post very soon.","og_url":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/","og_site_name":"Aeologic Blog","article_publisher":"https:\/\/www.facebook.com\/AeoLogicTech\/","article_published_time":"2020-04-04T17:06:26+00:00","article_modified_time":"2020-04-04T17:15:35+00:00","og_image":[{"width":1080,"height":622,"url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png","type":"image\/png"}],"author":"Manoj Kumar","twitter_card":"summary_large_image","twitter_creator":"@aeologictech","twitter_site":"@aeologictech","twitter_misc":{"Written by":"Manoj Kumar","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["Article","BlogPosting"],"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#article","isPartOf":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/"},"author":{"name":"Manoj Kumar","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4"},"headline":"Language analysis in Solr &#8211; Ultimate Solr Guide","datePublished":"2020-04-04T17:06:26+00:00","dateModified":"2020-04-04T17:15:35+00:00","mainEntityOfPage":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/"},"wordCount":843,"publisher":{"@id":"https:\/\/www.aeologic.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png","keywords":["language detection solr"],"articleSection":["Solr"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/","url":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/","name":"Language analysis in Solr - Ultimate Solr Guide - Aeologic Blog","isPartOf":{"@id":"https:\/\/www.aeologic.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png","datePublished":"2020-04-04T17:06:26+00:00","dateModified":"2020-04-04T17:15:35+00:00","breadcrumb":{"@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#primaryimage","url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png","contentUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/04\/Language-Analysis-in-Solr.png","width":1080,"height":622},{"@type":"BreadcrumbList","@id":"https:\/\/www.aeologic.com\/blog\/language-analysis-in-solr-ultimate-solr-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.aeologic.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Language analysis in Solr &#8211; Ultimate Solr Guide"}]},{"@type":"WebSite","@id":"https:\/\/www.aeologic.com\/blog\/#website","url":"https:\/\/www.aeologic.com\/blog\/","name":"Aeologic Blog","description":"Aeologic","publisher":{"@id":"https:\/\/www.aeologic.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.aeologic.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.aeologic.com\/blog\/#organization","name":"AeoLogic Technologies","url":"https:\/\/www.aeologic.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg","contentUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg","width":385,"height":162,"caption":"AeoLogic Technologies"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/AeoLogicTech\/","https:\/\/x.com\/aeologictech"]},{"@type":"Person","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4","name":"Manoj Kumar","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g","caption":"Manoj Kumar"},"description":"Manoj Kumar is a seasoned Digital Marketing Manager and passionate Tech Blogger with deep expertise in SEO, AI trends, and emerging digital technologies. He writes about innovative solutions that drive growth and transformation across industry. Featured on - YOURSTORY | TECHSLING | ELEARNINGINDUSTRY | DATASCIENCECENTRAL | TIMESOFINDIA | MEDIUM | DATAFLOQ","sameAs":["https:\/\/www.aeologic.com\/","https:\/\/www.linkedin.com\/in\/manoj-kumar-rajput\/"],"url":"https:\/\/www.aeologic.com\/blog\/author\/manoj\/"}]}},"_links":{"self":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts\/775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/comments?post=775"}],"version-history":[{"count":0,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts\/775\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/media\/782"}],"wp:attachment":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/media?parent=775"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/categories?post=775"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/tags?post=775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}