Home » Indexing and Searching Nested Documents in Solr – Ultimate Solr Guide

Indexing and Searching Nested Documents in Solr – Ultimate Solr Guide

Hi All, Today I will be writing about another important aspect of Solr that encapsulates the need of a lot of business applications. Suppose we need to index data of an e-commerce company that wants to host a variety of products on their websites along with associated SKU’s. This data needs to be indexed, updated, searched upon in a way that is legible to both the end-users and the merchandisers. So, a natural question arises, how to represent this data? Also, what matrix to follow in order to maintain the data easily and effectively?

To address this problem, Solr provides us with a mechanism to associate multiple docs with each other and put them in a relationship. Also, it provides both indexing and searching mechanisms to get data in and out. What we are referring to here is called a “Nested document structure”, or more technically, a BLOCK in Solr. Put simply, a block is a set of parent-child relation between multiple documents.

Let’s understand this in two parts.

Indexing Nested Documents

Let’s assume a nested structure as below:

{
“id”:”100″,
“scope_ss”:”product”,
“category_ss”:”OutdoorSofa”,
“productType_ss”:”sofa”,
_childDocuments_“: [
{
“id”:”101″,
“scope_ss”:”sku”,
“Fabric_ss”:”Polyester”,
“Depth_ss”:”Classic”,
“Finish_ss” : “Fine”,
“Color_ss”:”Grey”},
{
“id”:”102″,
“scope_ss”:”sku”,
“Fabric_ss”:”nylon”,
“Depth_ss”:”Petite”,
“Finish_ss” : “Grey”,
“Color_ss”:”Green”},
{
“id”:”103″,
“scope_ss”:”sku”,
“Fabric_ss”:”acrylic”,
“Depth_ss”:”Petite”,
“Finish_ss” : “Espresso”,
“Color_ss”:”Green”},
{
“id”:”104″,
“scope_ss”:”sku”,
“Fabric_ss”:”Polyester”,
“Depth_ss”: “Classic”,
“Finish_ss” : “Grey”,
“Color_ss”:”Dove”},

{“id”:”105″,
“scope_ss”:”sku”,
“Fabric_ss”:”nylon”,
“Depth_ss”: “Classic”,
“Finish_ss” : “Grey”,
“Color_ss”:”Blue”}]

}

_childDocuments_ tells solr that each object defined within the scope is a child to the doc it belongs to. In essence, we have just created a block of information that has properties for SKU’s which imparts some meaning to the application scenarios. Now let’s see how we can index the doc. Utilizing Solr’s dynamic schema mapping capabilities, we can be certain that all the fields in the document will get indexed and get reflected (Notice _ss in fields, they are stored as String array). Now, one may ask upon careful observation that why the data type used is array when the field is storing String type information. Answer to this question lies in more subtle understanding of inner working of Solr.

When Solr indexes a nested document, it stores all field values for a field in a set. This effectively means each field has multiple values and hence the type, array of String. If one tries to index data without using _ss or just _s (String type in Solr), solr engine throws an error. Now, assuming we have settled with basics, let’s continue to index the data.

Fire up UI and index data by navigating to necessary modalities as below:

Paste the JSON doc in “Documents(s)” section and index the document. If all configuration is proper, then, the index upon querying will appear as below:

Note: Notice how the parent doc (the product information) is referenced at bottom and child docs are piled upon it. This is not by accident but by design.

Querying Nested Documents

Let’s dive into basics of querying nested docs in Solr.

One of the most needed use-case in context of e-commerce data-set is Faceting. Faceting for nested docs by design is different than traditional docs in Solr.

  • To understand this, let’s generate facets for Color_ss and Fabric_ss fields.
    • Query Syntax: &json.facet={Fabrics:{type:terms,field:Fabric_ss,limit:-1},Colors:{type:terms,field:Color_ss,limit:-1}}&rows=0
    • Response:

If one observes carefully, one can see clearly how two buckets for each of the fields is formed. First bucket for Fabrics and second bucket for colors. Its worth mentioning here that the name, “Fabrics” and “Colors” are custom headers and can be changed at will.

  • Lets try something else. Let’s try nested faceting here.
    • Query: &json.facet={Fabrics:{type:terms,field:Fabric_ss,limit:-1,facet:{Colors:{type:terms,field:Color_ss,limit:-1}}}}&rows=0
    • Response:

Here, we have a bucket called “Fabrics” and nested within it is another bucket called “Colors”. In essence, we are presented with more granular information telling the color of each fabric type.

Let’s try another interesting query, one combining nested faceting with independent buckets.

  • Query: &json.facet={Depth:{type:terms,field:Depth_ss,limit:-1},Fabrics:{type:terms,field:Fabric_ss,limit:-1,facet:{Colors:{type:terms,field:Color_ss,limit:-1}}}}&rows=0
  • Response: 

Here, as one observes carefully, one bucket stands for independent Depth and other stands for nested facet for Fabric with Color. This enables businesses using Solr to present product information at a granular level for their customers.

Some other useful queries include:

  • To view all products and SKU information:
    • Query: q=*:*&rows=10
    • Response:
  • To fetch specific products and its associated SKUs:
    • Query: q={!child%20of=”scope_ss:product”}id:100&wt=json
    • Response:

{!child%20of=”scope_ss:product”} – This filter fetches information for all parent products by using the unique field-value pair used to identify a parent doc in index.

id:100 – This filter fetches “some” parents based on a specific field-value pair for a parent doc.

Updating a nested document in Solr

Updating a nested document in Solr involves is a very tricky business. One has to update a complete BLOCK if a single SKU has to be updated or a new SKU has to be added. (THIS IS THE ONLY LIMITATION OF SOLR). This essentially happens because of underlying Lucene handlers. Exact explanation for this, though present, is beyond the scope of this post.

Deleting a SKU/child document

The solr document is a little ambiguous about it though based on a simple POC, its clear that one can remove a child doc from solr by using a simple delete by ID API. One can invoke update handler for core/collection as below:

Query: update?stream.body=<delete><query>id:101</query></delete>&commit=true

Response:  This removes the child document with id:101 from index

So, that’s it for today. Will be back with another interesting post on Solr.