{"id":723,"date":"2020-03-18T11:08:05","date_gmt":"2020-03-18T11:08:05","guid":{"rendered":"https:\/\/www.aeologic.com\/blog\/?p=723"},"modified":"2020-03-18T11:08:31","modified_gmt":"2020-03-18T11:08:31","slug":"all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide","status":"publish","type":"post","link":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/","title":{"rendered":"All About Indexing and Basic Data Operations &#8211; Part 3 &#8211; Ultimate Solr Guide"},"content":{"rendered":"<p>Hello, Everyone! Today we are here with another post to further our discussion about basic indexing operations in solr. Solr provides a rich mechanism by which it can absorb documents of varied types such as PDF, word etc. The way it does that is by using Apache Tika Parser.<\/p>\n<h2 class=\"post-title-main\">Uploading Data with Solr Cell using Apache Tika<\/h2>\n<div class=\"paragraph\">\n<p>Solr uses code from the\u00a0<a href=\"http:\/\/lucene.apache.org\/tika\/\">Apache Tika<\/a>\u00a0project to provide a framework for incorporating many different file-format parsers such as\u00a0<a href=\"http:\/\/incubator.apache.org\/pdfbox\/\">Apache PDFBox<\/a>\u00a0and\u00a0<a href=\"http:\/\/poi.apache.org\/index.html\">Apache POI<\/a>\u00a0into Solr itself. Working with this framework, Solr\u2019s\u00a0<code>ExtractingRequestHandler<\/code>\u00a0can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework\u2019s name: Solr Cell.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>If you want to supply your own\u00a0<code>ContentHandler<\/code>\u00a0for Solr to use, you can extend the\u00a0<code>ExtractingRequestHandler<\/code>\u00a0and override the\u00a0<code>createFactory()<\/code>\u00a0method. This factory is responsible for constructing the\u00a0<code>SolrContentHandler<\/code>\u00a0that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter\u00a0<code>literalsOverride<\/code>, which normally defaults to\u00a0<code>true<\/code>, to\u00a0<code>false<\/code>\u00a0to append Tika-parsed values to literal values.<\/p>\n<h2 id=\"key-solr-cell-concepts\" class=\"clickable-header top-level-header\">Key Solr Cell Concepts<\/h2>\n<div class=\"paragraph\">\n<p>When using the Solr Cell framework, it is helpful to keep the following in mind:<\/p>\n<\/div>\n<div class=\"ulist\">\n<ul>\n<li>Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the\u00a0<code>stream.type<\/code>\u00a0parameter.<\/li>\n<li>Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see\u00a0<a class=\"bare\" href=\"http:\/\/www.saxproject.org\/quickstart.html\">http:\/\/www.saxproject.org\/quickstart.html<\/a>.<\/li>\n<li>Solr then responds to Tika\u2019s SAX events and creates the fields to index.<\/li>\n<li>Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See\u00a0<a class=\"bare\" href=\"http:\/\/tika.apache.org\/1.17\/formats.html\">http:\/\/tika.apache.org\/1.17\/formats.html<\/a>\u00a0for the file types supported.<\/li>\n<li>Tika adds all the extracted text to the\u00a0<code>content<\/code>\u00a0field.<\/li>\n<li>You can map Tika\u2019s metadata fields to Solr fields.<\/li>\n<li>You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any &#8220;captured content&#8221; fields.<\/li>\n<li>You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<p>One can use curl to send a sample PDF file via HTTP POST like below:<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-724\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70.png\" alt=\"\" width=\"1024\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-367x80.png 367w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-70-250x54.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<div class=\"paragraph\">\n<p>The URL above calls the Extracting Request Handler, uploads the file\u00a0<code>solr-word.pdf<\/code>\u00a0and assigns it the unique ID\u00a0<code>doc1<\/code>. Here\u2019s a closer look at the components of this command:<\/p>\n<\/div>\n<div class=\"ulist\">\n<ul>\n<li>The\u00a0<code>literal.id=doc1<\/code>\u00a0parameter provides the necessary unique ID for the document being indexed.<\/li>\n<li>The\u00a0<code>commit=true parameter<\/code>\u00a0causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don\u2019t call the commit command until you are done.<\/li>\n<li>The\u00a0<code>-F<\/code>\u00a0flag instructs curl to POST data using the Content-Type\u00a0<code>multipart\/form-data<\/code>\u00a0and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file.<\/li>\n<li>The argument\u00a0<code>myfile=@tutorial.html<\/code>\u00a0needs a valid path, which can be absolute or relative.<\/li>\n<\/ul>\n<\/div>\n<div class=\"paragraph\">\n<p>You can also use\u00a0<code>bin\/post<\/code>\u00a0to send a PDF file into Solr (without the params, the\u00a0<code>literal.id<\/code>\u00a0parameter would be set to the absolute path to the file):<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-725\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71.png\" alt=\"\" width=\"799\" height=\"205\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71.png 799w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-300x77.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-768x197.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-720x185.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-260x67.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-312x80.png 312w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-71-250x64.png 250w\" sizes=\"(max-width: 799px) 100vw, 799px\" \/><\/p>\n<p>Now you should be able to execute a query and find that document. You can make a request like:<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-726\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72.png\" alt=\"\" width=\"563\" height=\"205\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72.png 563w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72-300x109.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72-260x95.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72-220x80.png 220w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-72-250x91.png 250w\" sizes=\"(max-width: 563px) 100vw, 563px\" \/><\/p>\n<p>You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the\u00a0<code>\/update\/extract<\/code>\u00a0handler in\u00a0<code>solrconfig.xml<\/code>, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-727\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73.png\" alt=\"\" width=\"943\" height=\"205\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73.png 943w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-368x80.png 368w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-73-250x54.png 250w\" sizes=\"(max-width: 943px) 100vw, 943px\" \/><\/p>\n<p>In this command, the\u00a0<code>uprefix=attr_<\/code>\u00a0parameter causes all generated fields that aren\u2019t defined in the schema to be prefixed with\u00a0<code>attr_<\/code>, which is a dynamic field that is stored and indexed.<\/p>\n<p>This command allows you to query the document using an attribute, as in:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-728\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74.png\" alt=\"\" width=\"698\" height=\"205\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74.png 698w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74-300x88.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74-260x76.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74-272x80.png 272w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-74-250x73.png 250w\" sizes=\"(max-width: 698px) 100vw, 698px\" \/><\/p>\n<h2 id=\"solr-cell-input-parameters\" class=\"clickable-header top-level-header\">Solr Cell Input Parameters<\/h2>\n<p><code>capture<\/code><\/p>\n<p>Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (<code>&lt;p&gt;<\/code>) and index them into a separate field. Note that content is still also captured into the overall &#8220;content&#8221; field.<\/p>\n<p><code>captureAttr<\/code>Indexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in &lt;a&gt; tags as fields named &#8220;a&#8221;.<\/p>\n<p><code>commitWithin<\/code><\/p>\n<p>Add the document within the specified number of milliseconds.<\/p>\n<p><code>date.formats<\/code>Defines the date format patterns to identify in the documents.<\/p>\n<p><code>defaultField<\/code>If the\u00a0<code>uprefix<\/code>\u00a0parameter (see below) is not specified and a field cannot be determined, the default field will be used.<\/p>\n<p><code>extractOnly<\/code>Default is\u00a0<code>false<\/code>. If\u00a0<code>true<\/code>, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags.<\/p>\n<p><code>extractOnly<\/code><\/p>\n<p>Default is\u00a0<code>false<\/code>. If\u00a0<code>true<\/code>, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. For an example, see\u00a0<a class=\"bare\" href=\"http:\/\/wiki.apache.org\/solr\/TikaExtractOnlyExampleOutput\">http:\/\/wiki.apache.org\/solr\/TikaExtractOnlyExampleOutput<\/a>.<\/p>\n<p><code>extractFormat<\/code>The default is\u00a0<code>xml<\/code>, but the other option is\u00a0<code>text<\/code>. Controls the serialization format of the extract content. The\u00a0<code>xml<\/code>\u00a0format is actually XHTML, the same format that results from passing the\u00a0<code>-x<\/code>\u00a0command to the Tika command line application, while the text format is like that produced by Tika\u2019s\u00a0<code>-t<\/code>\u00a0command. This parameter is valid only if\u00a0<code>extractOnly<\/code>\u00a0is set to true.<\/p>\n<p><code>fmap.<em>source_field<\/em><\/code>Maps (moves) one field name to another. The\u00a0<code>source_field<\/code>\u00a0must be a field in incoming documents, and the value is the Solr field to map to. Example:\u00a0<code>fmap.content=text<\/code>\u00a0causes the data in the\u00a0<code>content<\/code>\u00a0field generated by Tika to be moved to the Solr\u2019s\u00a0<code>text<\/code>\u00a0field.<\/p>\n<p><code>ignoreTikaException<\/code>If\u00a0<code>true<\/code>, exceptions found during processing will be skipped. Any metadata available, however, will be indexed.<\/p>\n<p><code>literal.<em>fieldname<\/em><\/code>Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued.<\/p>\n<p><code>literalsOverride<\/code><\/p>\n<p>If\u00a0<code>true<\/code>\u00a0(the default), literal field values will override other values with the same field name. If\u00a0<code>false<\/code>, literal values defined with\u00a0<code>literal.<em>fieldname<\/em><\/code>\u00a0will be appended to data already in the fields extracted from Tika. If setting\u00a0<code>literalsOverride<\/code>\u00a0to\u00a0<code>false<\/code>, the field must be multivalued.<\/p>\n<p><code>lowernames<\/code>Values are\u00a0<code>true<\/code>\u00a0or\u00a0<code>false<\/code>. If\u00a0<code>true<\/code>, all field names will be mapped to lowercase with underscores, if needed. For example, &#8220;Content-Type&#8221; would be mapped to &#8220;content_type.&#8221;<\/p>\n<p><code>multipartUploadLimitInKB<\/code>Useful if uploading very large documents, this defines the KB size of documents to allow.<\/p>\n<p><code>passwordsFile<\/code>Defines a file path and name for a file of file name to password mappings.<\/p>\n<p><code>resource.name<\/code>Specifies the optional name of the file. Tika can use it as a hint for detecting a file\u2019s MIME type.<\/p>\n<p><code>resource.password<\/code>Defines a password to use for a password-protected PDF or OOXML file<\/p>\n<p><code>tika.config<\/code>Defines a file path and name to a customized Tika configuration file. This is only required if you have customized your Tika implementation.<\/p>\n<p><code>uprefix<\/code>Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example:\u00a0<code>uprefix=ignored_<\/code>\u00a0would effectively ignore all unknown fields generated by Tika given the example schema contains\u00a0<code>&lt;dynamicField name=\"ignored_*\" type=\"ignored\"\/&gt;<\/code><\/p>\n<p><code>xpath<\/code>When extracting, only return Tika XHTML content that satisfies the given XPath expression.<\/p>\n<h2 id=\"order-of-operations\" class=\"clickable-header top-level-header\">Order of Operations<\/h2>\n<div class=\"paragraph\">\n<p>Here is the order in which the Solr Cell framework, using the Extracting Request Handler and Tika, processes its input.<\/p>\n<\/div>\n<div class=\"olist arabic\">\n<ol class=\"arabic\">\n<li>Tika generates fields or passes them in as literals specified by\u00a0<code>literal.&lt;fieldname&gt;=&lt;value&gt;<\/code>. If\u00a0<code>literalsOverride=false<\/code>, literals will be appended as multi-value to the Tika-generated field.<\/li>\n<li>If\u00a0<code>lowernames=true<\/code>, Tika maps fields to lowercase.<\/li>\n<li>Tika applies the mapping rules specified by\u00a0<code>fmap.<em>source<\/em>=<em>target<\/em><\/code>\u00a0parameters.<\/li>\n<li>If\u00a0<code>uprefix<\/code>\u00a0is specified, any unknown field names are prefixed with that value, else if\u00a0<code>defaultField<\/code>\u00a0is specified, any unknown fields are copied to the default field.<\/li>\n<\/ol>\n<\/div>\n<h2 id=\"configuring-the-solr-extractingrequesthandler\" class=\"clickable-header top-level-header\">Configuring the Solr ExtractingRequestHandler<\/h2>\n<p>Add the following dependencies in solrconfig.xml file.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-729\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75.png\" alt=\"\" width=\"833\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75.png 833w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-300x80.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-768x206.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-720x193.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-260x70.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-299x80.png 299w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-75-250x67.png 250w\" sizes=\"(max-width: 833px) 100vw, 833px\" \/><\/p>\n<p>You can then configure the\u00a0<code>ExtractingRequestHandler<\/code>\u00a0in\u00a0<code>solrconfig.xml<\/code>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-730\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76.png\" alt=\"\" width=\"1024\" height=\"493\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-300x144.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-768x370.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-720x347.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-260x125.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-166x80.png 166w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-76-250x120.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<div class=\"paragraph\">\n<p>In the defaults section, we are mapping Tika\u2019s Last-Modified Metadata attribute to a field named\u00a0<code>last_modified<\/code>. We are also telling it to ignore undeclared fields. These are all overridden parameters.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>The\u00a0<code>tika.config<\/code>\u00a0entry points to a file containing a Tika configuration. The\u00a0<code>date.formats<\/code>\u00a0allows you to specify various\u00a0<code>java.text.SimpleDateFormats<\/code>\u00a0date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the\u00a0<code>DateUtil<\/code>\u00a0in Solr):<\/p>\n<\/div>\n<div class=\"ulist\">\n<ul>\n<li><code>yyyy-MM-dd\u2019T\u2019HH:mm:ss\u2019Z'<\/code><\/li>\n<li><code>yyyy-MM-dd\u2019T\u2019HH:mm:ss<\/code><\/li>\n<li><code>yyyy-MM-dd<\/code><\/li>\n<li><code>yyyy-MM-dd hh:mm:ss<\/code><\/li>\n<li><code>yyyy-MM-dd HH:mm:ss<\/code><\/li>\n<li><code>EEE MMM d hh:mm:ss z yyyy<\/code><\/li>\n<li><code>EEE, dd MMM yyyy HH:mm:ss zzz<\/code><\/li>\n<li><code>EEEE, dd-MMM-yy HH:mm:ss zzz<\/code><\/li>\n<li><code>EEE MMM d HH:mm:ss yyyy<\/code><\/li>\n<\/ul>\n<h3 id=\"parser-specific-properties\" class=\"clickable-header\">Parser-Specific Properties<\/h3>\n<p>Parsers used by Tika may have specific properties to govern how data is extracted. For instance, when using the Tika library from a Java program, the PDFParserConfig class has a method\u00a0<code>setSortByPosition(boolean)<\/code>\u00a0that can extract vertically oriented text. To access that method via configuration with the ExtractingRequestHandler, one can add the\u00a0<code>parseContext.config<\/code>\u00a0property to the\u00a0<code>solrconfig.xml<\/code>\u00a0file (see above) and then set properties in Tika\u2019s PDFParserConfig as below. Consult the Tika Java API documentation for configuration parameters that can be set for any particular parsers that require this level of control.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-731\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77.png\" alt=\"\" width=\"1024\" height=\"331\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-300x97.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-768x248.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-720x233.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-260x84.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-247x80.png 247w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-77-250x81.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h3 id=\"multi-core-configuration\" class=\"clickable-header\">Multi-Core Configuration<\/h3>\n<p>For a multi-core configuration, you can specify\u00a0<code>sharedLib='lib'<\/code>\u00a0in the\u00a0<code>&lt;solr\/&gt;<\/code>\u00a0section of\u00a0<code>solr.xml<\/code>\u00a0and place the necessary jar files there.<\/p>\n<h2 id=\"indexing-encrypted-documents-with-the-extractingupdaterequesthandler\" class=\"clickable-header top-level-header\">Indexing Encrypted Documents with the ExtractingUpdateRequestHandler<\/h2>\n<p>The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either\u00a0<code>resource.password<\/code>\u00a0on the request, or in a\u00a0<code>passwordsFile<\/code>\u00a0file.<\/p>\n<p>In the case of\u00a0<code>passwordsFile<\/code>, the file supplied must be formatted so there is one line per rule. Each rule contains a file name regular expression, followed by &#8220;=&#8221;, then the password in clear-text. Because the passwords are in clear-text, the file should have strict access restrictions.<\/p>\n<h2 id=\"solr-cell-examples\" class=\"clickable-header top-level-header\">Solr Cell Examples<\/h2>\n<div class=\"sectionbody\">\n<div class=\"sect2\">\n<h3 id=\"metadata-created-by-tika\" class=\"clickable-header\">Metadata Created by Tika<\/h3>\n<div class=\"paragraph\">\n<p>As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a document, such as the author\u2019s name, the number of pages, the file size, and so on. The metadata produced depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>In addition to Tika\u2019s metadata, Solr adds the following metadata (defined in\u00a0<code>ExtractingMetadataConstants<\/code>):<\/p>\n<p><code>stream_name<\/code><\/p>\n<p>The name of the Content Stream as uploaded to Solr. Depending on how the file is uploaded, this may or may not be set.<\/p>\n<p><code>stream_source_info<\/code>Any source info about the stream. (See the section on Content Streams later in this section.)<\/p>\n<p><code>stream_size<\/code>The size of the stream in bytes.<\/p>\n<p><code>stream_content_type<\/code>The content type of the stream, if available.<\/p>\n<h3 id=\"examples-of-uploads-using-the-extracting-request-handler\" class=\"clickable-header\">Examples of Uploads Using the Extracting Request Handler<\/h3>\n<h4 id=\"capture-and-mapping\">Capture and Mapping<\/h4>\n<div class=\"paragraph\">\n<p>The command below captures\u00a0<code>&lt;div&gt;<\/code>\u00a0tags separately, and then maps all the instances of that field to a dynamic field named\u00a0<code>foo_t<\/code>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-732\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78.png\" alt=\"\" width=\"1024\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-367x80.png 367w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-78-250x54.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h4 id=\"using-literals-to-define-your-own-metadata\">Using Literals to Define Your Own Metadata<\/h4>\n<p>To add in your own metadata, pass in the literal parameter along with the file:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-733\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79.png\" alt=\"\" width=\"1024\" height=\"241\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-300x71.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-768x181.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-720x169.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-260x61.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-340x80.png 340w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-79-250x59.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h4 id=\"xpath-expressions\">XPath Expressions<\/h4>\n<p>The example below passes in an XPath expression to restrict the XHTML returned by Tika:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-734\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80.png\" alt=\"\" width=\"1024\" height=\"241\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-300x71.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-768x181.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-720x169.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-260x61.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-340x80.png 340w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-80-250x59.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h3 id=\"extracting-data-without-indexing-it\" class=\"clickable-header\">Extracting Data without Indexing It<\/h3>\n<p>Solr allows you to extract data without indexing. You might want to do this if you\u2019re using Solr solely as an extraction server or if you\u2019re interested in testing Solr extraction.<\/p>\n<p>The example below sets the\u00a0<code>extractOnly=true<\/code>\u00a0parameter to extract data without indexing it.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-735\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81.png\" alt=\"\" width=\"1024\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-367x80.png 367w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-81-250x54.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>The output includes XML generated by Tika (and further escaped by Solr\u2019s XML) using a different output format to make it more readable (<code>-out yes<\/code>\u00a0instructs the tool to echo Solr\u2019s output to the console):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-736\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82.png\" alt=\"\" width=\"1024\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-367x80.png 367w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-82-250x54.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h2 id=\"sending-documents-to-solr-with-a-post\" class=\"clickable-header top-level-header\">Sending Documents to Solr with a POST<\/h2>\n<p>The example below streams the file as the body of the POST, which does not, then, provide information to Solr about the name of the file.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-737\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83.png\" alt=\"\" width=\"1024\" height=\"223\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83.png 1024w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-300x65.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-768x167.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-720x157.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-260x57.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-367x80.png 367w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-83-250x54.png 250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<h2 id=\"sending-documents-to-solr-with-solr-cell-and-solrj\" class=\"clickable-header top-level-header\">Sending Documents to Solr with Solr Cell and SolrJ<\/h2>\n<div class=\"paragraph\">\n<p>SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You\u2019ll find more information on SolrJ in\u00a0<a href=\"https:\/\/lucene.apache.org\/solr\/guide\/7_4\/client-apis.html#client-apis\">Client APIs<\/a>.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Here\u2019s an example of using Solr Cell and SolrJ to add documents to a Solr index.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>First, let\u2019s use SolrJ to create a new SolrClient, then we\u2019ll construct a request containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-738\" src=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84.png\" alt=\"\" width=\"976\" height=\"349\" srcset=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84.png 976w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-300x107.png 300w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-768x275.png 768w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-720x257.png 720w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-260x93.png 260w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-224x80.png 224w, https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/03\/carbon-84-250x89.png 250w\" sizes=\"(max-width: 976px) 100vw, 976px\" \/><\/p>\n<div class=\"paragraph\">\n<p>The sample code above calls the extract command, but you can easily substitute other commands that are supported by Solr Cell. The key class to use is the\u00a0<code>ContentStreamUpdateRequest<\/code>, which makes sure the ContentStreams are set properly. SolrJ takes care of the rest.<\/p>\n<\/div>\n<div class=\"paragraph\">\n<p>Note that the\u00a0<code>ContentStreamUpdateRequest<\/code> is not just specific to Solr Cell. You can send CSV to the CSV Update handler and to any other Request Handler that works with Content Streams for updates.<\/p>\n<p>So, this is it for today. Stay tuned for another post.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Hello, Everyone! Today we are here with another post to further our discussion about basic indexing operations in solr. Solr provides a rich mechanism by which it can absorb documents of varied types such as PDF, word etc. The way it does that is by using Apache Tika Parser. Uploading Data with Solr Cell using Apache Tika Solr uses code from the\u00a0Apache Tika\u00a0project to provide a framework for incorporating many different file-format parsers such as\u00a0Apache PDFBox\u00a0and\u00a0Apache POI\u00a0into Solr itself. Working with this framework, Solr\u2019s\u00a0ExtractingRequestHandler\u00a0can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing. When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework\u2019s name: Solr Cell. If you want to supply your own\u00a0ContentHandler\u00a0for Solr to use, you can extend the\u00a0ExtractingRequestHandler\u00a0and override the\u00a0createFactory()\u00a0method. This factory is responsible for constructing the\u00a0SolrContentHandler\u00a0that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter\u00a0literalsOverride, which normally defaults to\u00a0true, to\u00a0false\u00a0to append Tika-parsed values to literal values. Key Solr Cell Concepts When using the Solr Cell framework, it is helpful to keep the following in mind: Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the\u00a0stream.type\u00a0parameter. Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see\u00a0http:\/\/www.saxproject.org\/quickstart.html. Solr then responds to Tika\u2019s SAX events and creates the fields to index. Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See\u00a0http:\/\/tika.apache.org\/1.17\/formats.html\u00a0for the file types supported. Tika adds all the extracted text to the\u00a0content\u00a0field. You can map Tika\u2019s metadata fields to Solr fields. You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any &#8220;captured content&#8221; fields. You can apply an XPath expression to the Tika XHTML to restrict the content that is produced. One can use curl to send a sample PDF file via HTTP POST like below: The URL above calls the Extracting Request Handler, uploads the file\u00a0solr-word.pdf\u00a0and assigns it the unique ID\u00a0doc1. Here\u2019s a closer look at the components of this command: The\u00a0literal.id=doc1\u00a0parameter provides the necessary unique ID for the document being indexed. The\u00a0commit=true parameter\u00a0causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don\u2019t call the commit command until you are done. The\u00a0-F\u00a0flag instructs curl to POST data using the Content-Type\u00a0multipart\/form-data\u00a0and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file. The argument\u00a0myfile=@tutorial.html\u00a0needs a valid path, which can be absolute or relative. You can also use\u00a0bin\/post\u00a0to send a PDF file into Solr (without the params, the\u00a0literal.id\u00a0parameter would be set to the absolute path to the file): Now you should be able to execute a query and find that document. You can make a request like: You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the\u00a0\/update\/extract\u00a0handler in\u00a0solrconfig.xml, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following: In this command, the\u00a0uprefix=attr_\u00a0parameter causes all generated fields that aren\u2019t defined in the schema to be prefixed with\u00a0attr_, which is a dynamic field that is stored and indexed. This command allows you to query the document using an attribute, as in: Solr Cell Input Parameters capture Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (&lt;p&gt;) and index them into a separate field. Note that content is still also captured into the overall &#8220;content&#8221; field. captureAttrIndexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in &lt;a&gt; tags as fields named &#8220;a&#8221;. commitWithin Add the document within the specified number of milliseconds. date.formatsDefines the date format patterns to identify in the documents. defaultFieldIf the\u00a0uprefix\u00a0parameter (see below) is not specified and a field cannot be determined, the default field will be used. extractOnlyDefault is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. extractOnly Default is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. For an example, see\u00a0http:\/\/wiki.apache.org\/solr\/TikaExtractOnlyExampleOutput. extractFormatThe default is\u00a0xml, but the other option is\u00a0text. Controls the serialization format of the extract content. The\u00a0xml\u00a0format is actually XHTML, the same format that results from passing the\u00a0-x\u00a0command to the Tika command line application, while the text format is like that produced by Tika\u2019s\u00a0-t\u00a0command. This parameter is valid only if\u00a0extractOnly\u00a0is set to true. fmap.source_fieldMaps (moves) one field name to another. The\u00a0source_field\u00a0must be a field in incoming documents, and the value is the Solr field to map to. Example:\u00a0fmap.content=text\u00a0causes the data in the\u00a0content\u00a0field generated by Tika to be moved to the Solr\u2019s\u00a0text\u00a0field. ignoreTikaExceptionIf\u00a0true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed. literal.fieldnamePopulates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued. literalsOverride If\u00a0true\u00a0(the default), literal [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":635,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[41],"tags":[],"class_list":["post-723","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-solr"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog\" \/>\n<meta property=\"og:description\" content=\"Hello, Everyone! Today we are here with another post to further our discussion about basic indexing operations in solr. Solr provides a rich mechanism by which it can absorb documents of varied types such as PDF, word etc. The way it does that is by using Apache Tika Parser. Uploading Data with Solr Cell using Apache Tika Solr uses code from the\u00a0Apache Tika\u00a0project to provide a framework for incorporating many different file-format parsers such as\u00a0Apache PDFBox\u00a0and\u00a0Apache POI\u00a0into Solr itself. Working with this framework, Solr\u2019s\u00a0ExtractingRequestHandler\u00a0can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing. When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework\u2019s name: Solr Cell. If you want to supply your own\u00a0ContentHandler\u00a0for Solr to use, you can extend the\u00a0ExtractingRequestHandler\u00a0and override the\u00a0createFactory()\u00a0method. This factory is responsible for constructing the\u00a0SolrContentHandler\u00a0that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter\u00a0literalsOverride, which normally defaults to\u00a0true, to\u00a0false\u00a0to append Tika-parsed values to literal values. Key Solr Cell Concepts When using the Solr Cell framework, it is helpful to keep the following in mind: Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the\u00a0stream.type\u00a0parameter. Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see\u00a0http:\/\/www.saxproject.org\/quickstart.html. Solr then responds to Tika\u2019s SAX events and creates the fields to index. Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See\u00a0http:\/\/tika.apache.org\/1.17\/formats.html\u00a0for the file types supported. Tika adds all the extracted text to the\u00a0content\u00a0field. You can map Tika\u2019s metadata fields to Solr fields. You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any &#8220;captured content&#8221; fields. You can apply an XPath expression to the Tika XHTML to restrict the content that is produced. One can use curl to send a sample PDF file via HTTP POST like below: The URL above calls the Extracting Request Handler, uploads the file\u00a0solr-word.pdf\u00a0and assigns it the unique ID\u00a0doc1. Here\u2019s a closer look at the components of this command: The\u00a0literal.id=doc1\u00a0parameter provides the necessary unique ID for the document being indexed. The\u00a0commit=true parameter\u00a0causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don\u2019t call the commit command until you are done. The\u00a0-F\u00a0flag instructs curl to POST data using the Content-Type\u00a0multipart\/form-data\u00a0and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file. The argument\u00a0myfile=@tutorial.html\u00a0needs a valid path, which can be absolute or relative. You can also use\u00a0bin\/post\u00a0to send a PDF file into Solr (without the params, the\u00a0literal.id\u00a0parameter would be set to the absolute path to the file): Now you should be able to execute a query and find that document. You can make a request like: You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the\u00a0\/update\/extract\u00a0handler in\u00a0solrconfig.xml, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following: In this command, the\u00a0uprefix=attr_\u00a0parameter causes all generated fields that aren\u2019t defined in the schema to be prefixed with\u00a0attr_, which is a dynamic field that is stored and indexed. This command allows you to query the document using an attribute, as in: Solr Cell Input Parameters capture Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (&lt;p&gt;) and index them into a separate field. Note that content is still also captured into the overall &#8220;content&#8221; field. captureAttrIndexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in &lt;a&gt; tags as fields named &#8220;a&#8221;. commitWithin Add the document within the specified number of milliseconds. date.formatsDefines the date format patterns to identify in the documents. defaultFieldIf the\u00a0uprefix\u00a0parameter (see below) is not specified and a field cannot be determined, the default field will be used. extractOnlyDefault is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. extractOnly Default is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. For an example, see\u00a0http:\/\/wiki.apache.org\/solr\/TikaExtractOnlyExampleOutput. extractFormatThe default is\u00a0xml, but the other option is\u00a0text. Controls the serialization format of the extract content. The\u00a0xml\u00a0format is actually XHTML, the same format that results from passing the\u00a0-x\u00a0command to the Tika command line application, while the text format is like that produced by Tika\u2019s\u00a0-t\u00a0command. This parameter is valid only if\u00a0extractOnly\u00a0is set to true. fmap.source_fieldMaps (moves) one field name to another. The\u00a0source_field\u00a0must be a field in incoming documents, and the value is the Solr field to map to. Example:\u00a0fmap.content=text\u00a0causes the data in the\u00a0content\u00a0field generated by Tika to be moved to the Solr\u2019s\u00a0text\u00a0field. ignoreTikaExceptionIf\u00a0true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed. literal.fieldnamePopulates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued. literalsOverride If\u00a0true\u00a0(the default), literal [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"Aeologic Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/AeoLogicTech\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-03-18T11:08:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-03-18T11:08:31+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"622\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Manoj Kumar\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@aeologictech\" \/>\n<meta name=\"twitter:site\" content=\"@aeologictech\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Manoj Kumar\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":[\"Article\",\"BlogPosting\"],\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\"},\"author\":{\"name\":\"Manoj Kumar\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4\"},\"headline\":\"All About Indexing and Basic Data Operations &#8211; Part 3 &#8211; Ultimate Solr Guide\",\"datePublished\":\"2020-03-18T11:08:05+00:00\",\"dateModified\":\"2020-03-18T11:08:31+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\"},\"wordCount\":2113,\"publisher\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png\",\"articleSection\":[\"Solr\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\",\"url\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\",\"name\":\"All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog\",\"isPartOf\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png\",\"datePublished\":\"2020-03-18T11:08:05+00:00\",\"dateModified\":\"2020-03-18T11:08:31+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage\",\"url\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png\",\"contentUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png\",\"width\":1080,\"height\":622},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.aeologic.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"All About Indexing and Basic Data Operations &#8211; Part 3 &#8211; Ultimate Solr Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#website\",\"url\":\"https:\/\/www.aeologic.com\/blog\/\",\"name\":\"Aeologic Blog\",\"description\":\"Aeologic\",\"publisher\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.aeologic.com\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#organization\",\"name\":\"AeoLogic Technologies\",\"url\":\"https:\/\/www.aeologic.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg\",\"contentUrl\":\"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg\",\"width\":385,\"height\":162,\"caption\":\"AeoLogic Technologies\"},\"image\":{\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/AeoLogicTech\/\",\"https:\/\/x.com\/aeologictech\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4\",\"name\":\"Manoj Kumar\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g\",\"caption\":\"Manoj Kumar\"},\"description\":\"Manoj Kumar is a seasoned Digital Marketing Manager and passionate Tech Blogger with deep expertise in SEO, AI trends, and emerging digital technologies. He writes about innovative solutions that drive growth and transformation across industry. Featured on - YOURSTORY | TECHSLING | ELEARNINGINDUSTRY | DATASCIENCECENTRAL | TIMESOFINDIA | MEDIUM | DATAFLOQ\",\"sameAs\":[\"https:\/\/www.aeologic.com\/\",\"https:\/\/www.linkedin.com\/in\/manoj-kumar-rajput\/\"],\"url\":\"https:\/\/www.aeologic.com\/blog\/author\/manoj\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/","og_locale":"en_US","og_type":"article","og_title":"All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog","og_description":"Hello, Everyone! Today we are here with another post to further our discussion about basic indexing operations in solr. Solr provides a rich mechanism by which it can absorb documents of varied types such as PDF, word etc. The way it does that is by using Apache Tika Parser. Uploading Data with Solr Cell using Apache Tika Solr uses code from the\u00a0Apache Tika\u00a0project to provide a framework for incorporating many different file-format parsers such as\u00a0Apache PDFBox\u00a0and\u00a0Apache POI\u00a0into Solr itself. Working with this framework, Solr\u2019s\u00a0ExtractingRequestHandler\u00a0can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing. When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework\u2019s name: Solr Cell. If you want to supply your own\u00a0ContentHandler\u00a0for Solr to use, you can extend the\u00a0ExtractingRequestHandler\u00a0and override the\u00a0createFactory()\u00a0method. This factory is responsible for constructing the\u00a0SolrContentHandler\u00a0that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter\u00a0literalsOverride, which normally defaults to\u00a0true, to\u00a0false\u00a0to append Tika-parsed values to literal values. Key Solr Cell Concepts When using the Solr Cell framework, it is helpful to keep the following in mind: Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the\u00a0stream.type\u00a0parameter. Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see\u00a0http:\/\/www.saxproject.org\/quickstart.html. Solr then responds to Tika\u2019s SAX events and creates the fields to index. Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See\u00a0http:\/\/tika.apache.org\/1.17\/formats.html\u00a0for the file types supported. Tika adds all the extracted text to the\u00a0content\u00a0field. You can map Tika\u2019s metadata fields to Solr fields. You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any &#8220;captured content&#8221; fields. You can apply an XPath expression to the Tika XHTML to restrict the content that is produced. One can use curl to send a sample PDF file via HTTP POST like below: The URL above calls the Extracting Request Handler, uploads the file\u00a0solr-word.pdf\u00a0and assigns it the unique ID\u00a0doc1. Here\u2019s a closer look at the components of this command: The\u00a0literal.id=doc1\u00a0parameter provides the necessary unique ID for the document being indexed. The\u00a0commit=true parameter\u00a0causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don\u2019t call the commit command until you are done. The\u00a0-F\u00a0flag instructs curl to POST data using the Content-Type\u00a0multipart\/form-data\u00a0and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file. The argument\u00a0myfile=@tutorial.html\u00a0needs a valid path, which can be absolute or relative. You can also use\u00a0bin\/post\u00a0to send a PDF file into Solr (without the params, the\u00a0literal.id\u00a0parameter would be set to the absolute path to the file): Now you should be able to execute a query and find that document. You can make a request like: You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the\u00a0\/update\/extract\u00a0handler in\u00a0solrconfig.xml, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following: In this command, the\u00a0uprefix=attr_\u00a0parameter causes all generated fields that aren\u2019t defined in the schema to be prefixed with\u00a0attr_, which is a dynamic field that is stored and indexed. This command allows you to query the document using an attribute, as in: Solr Cell Input Parameters capture Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (&lt;p&gt;) and index them into a separate field. Note that content is still also captured into the overall &#8220;content&#8221; field. captureAttrIndexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in &lt;a&gt; tags as fields named &#8220;a&#8221;. commitWithin Add the document within the specified number of milliseconds. date.formatsDefines the date format patterns to identify in the documents. defaultFieldIf the\u00a0uprefix\u00a0parameter (see below) is not specified and a field cannot be determined, the default field will be used. extractOnlyDefault is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. extractOnly Default is\u00a0false. If\u00a0true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. For an example, see\u00a0http:\/\/wiki.apache.org\/solr\/TikaExtractOnlyExampleOutput. extractFormatThe default is\u00a0xml, but the other option is\u00a0text. Controls the serialization format of the extract content. The\u00a0xml\u00a0format is actually XHTML, the same format that results from passing the\u00a0-x\u00a0command to the Tika command line application, while the text format is like that produced by Tika\u2019s\u00a0-t\u00a0command. This parameter is valid only if\u00a0extractOnly\u00a0is set to true. fmap.source_fieldMaps (moves) one field name to another. The\u00a0source_field\u00a0must be a field in incoming documents, and the value is the Solr field to map to. Example:\u00a0fmap.content=text\u00a0causes the data in the\u00a0content\u00a0field generated by Tika to be moved to the Solr\u2019s\u00a0text\u00a0field. ignoreTikaExceptionIf\u00a0true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed. literal.fieldnamePopulates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued. literalsOverride If\u00a0true\u00a0(the default), literal [&hellip;]","og_url":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/","og_site_name":"Aeologic Blog","article_publisher":"https:\/\/www.facebook.com\/AeoLogicTech\/","article_published_time":"2020-03-18T11:08:05+00:00","article_modified_time":"2020-03-18T11:08:31+00:00","og_image":[{"width":1080,"height":622,"url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png","type":"image\/png"}],"author":"Manoj Kumar","twitter_card":"summary_large_image","twitter_creator":"@aeologictech","twitter_site":"@aeologictech","twitter_misc":{"Written by":"Manoj Kumar","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["Article","BlogPosting"],"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#article","isPartOf":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/"},"author":{"name":"Manoj Kumar","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4"},"headline":"All About Indexing and Basic Data Operations &#8211; Part 3 &#8211; Ultimate Solr Guide","datePublished":"2020-03-18T11:08:05+00:00","dateModified":"2020-03-18T11:08:31+00:00","mainEntityOfPage":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/"},"wordCount":2113,"publisher":{"@id":"https:\/\/www.aeologic.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png","articleSection":["Solr"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/","url":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/","name":"All About Indexing and Basic Data Operations - Part 3 - Ultimate Solr Guide - Aeologic Blog","isPartOf":{"@id":"https:\/\/www.aeologic.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png","datePublished":"2020-03-18T11:08:05+00:00","dateModified":"2020-03-18T11:08:31+00:00","breadcrumb":{"@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#primaryimage","url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png","contentUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2020\/02\/Indexing-and-Basic-Data-Operations.png","width":1080,"height":622},{"@type":"BreadcrumbList","@id":"https:\/\/www.aeologic.com\/blog\/all-about-indexing-and-basic-data-operations-part-3-ultimate-solr-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.aeologic.com\/blog\/"},{"@type":"ListItem","position":2,"name":"All About Indexing and Basic Data Operations &#8211; Part 3 &#8211; Ultimate Solr Guide"}]},{"@type":"WebSite","@id":"https:\/\/www.aeologic.com\/blog\/#website","url":"https:\/\/www.aeologic.com\/blog\/","name":"Aeologic Blog","description":"Aeologic","publisher":{"@id":"https:\/\/www.aeologic.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.aeologic.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.aeologic.com\/blog\/#organization","name":"AeoLogic Technologies","url":"https:\/\/www.aeologic.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg","contentUrl":"https:\/\/www.aeologic.com\/blog\/wp-content\/uploads\/2022\/05\/new-logo-aeo.jpg","width":385,"height":162,"caption":"AeoLogic Technologies"},"image":{"@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/AeoLogicTech\/","https:\/\/x.com\/aeologictech"]},{"@type":"Person","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/13549984ba8e5f441cc733ed20d7daa4","name":"Manoj Kumar","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.aeologic.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/24ce77602da5eb5715d74a95733f6c7548e2af73f5a493f9bc0bf55f611d025e?s=96&d=mm&r=g","caption":"Manoj Kumar"},"description":"Manoj Kumar is a seasoned Digital Marketing Manager and passionate Tech Blogger with deep expertise in SEO, AI trends, and emerging digital technologies. He writes about innovative solutions that drive growth and transformation across industry. Featured on - YOURSTORY | TECHSLING | ELEARNINGINDUSTRY | DATASCIENCECENTRAL | TIMESOFINDIA | MEDIUM | DATAFLOQ","sameAs":["https:\/\/www.aeologic.com\/","https:\/\/www.linkedin.com\/in\/manoj-kumar-rajput\/"],"url":"https:\/\/www.aeologic.com\/blog\/author\/manoj\/"}]}},"_links":{"self":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts\/723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/comments?post=723"}],"version-history":[{"count":0,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/posts\/723\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/media\/635"}],"wp:attachment":[{"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/media?parent=723"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/categories?post=723"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aeologic.com\/blog\/wp-json\/wp\/v2\/tags?post=723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}