Efficiently Handling Deep Pagination In A Distributed Search Engine

“Hey, you told me you had 10 million results. What’s the deal!”

As users, we don’t always find what we’re looking for on the first page, so we click “next page”, sometimes repeatedly. In many search engines, performance drops off quickly if a user paginates deeply into search results. Search engines commonly deal with this problem by preventing users from navigating past a certain page. For example, Google will not return anything past the 1000th result.

This decision might make sense for a text document search engine, but the brain processes images more quickly than text blurbs, so it’s a bit more common for a user to look through a set of 1,000 images before selecting what they want. Therefore, for a media company like Shutterstock, it’s worth a. examining why deep pagination is slow and b. considering how we can improve it.

Deep page queries are expensive to perform for a few reasons:

  1. Caching: If you have a cache in front of your search engine, cache hits are less likely on higher pages, because relatively few users paginate deeply into results.
  2. Sorting: The search engine must build a data structure (typically a priority queue) sorted by document score containing every previous result to determine what belongs on the current page. Not only is this a non-linear time operation, but also it requires a lot of memory allocation and can increase heap pressure.
  3. Distributed Search: In a sharded index configuration, deep page query performance does not scale linearly because each shard must still produce a full document set (more on this later).

Let’s look at how these problems can be solved in the Solr world, keeping in mind that these solutions are generalizable to Elastic Search and other search engines:

  1. Caching: Many incremental improvements can be made by tinkering with the caching strategy, but if the system can’t withstand peak traffic without its cache, you’re setting yourself up for an outage if the cache is ever blown.
  2. Sorting: Solr 4.7+ offers a feature called cursor mark, which requires the client to pass a “cursor” representing the score of the last document on the last page fetched. Under the hood, this cursor is then used to filter out any documents with a lower score before inserting into the priority queue, meaning records on previous pages do not need to be sorted. However, there are two issues  with this approach:
    1. It requires the client to pass these cursors.
    2. A user cannot visit a page without first visiting the previous page.
  3. Distributed Search: The previous two techniques alleviate the symptoms of the problem with some caveats, but if we could find a way to make deep paging scale well across a distributed system, this whole problem should go away. In the rest of this article, we will focus on an easy way Shutterstock has found to make deep pagination more distributed-search-friendly.

Let’s first review how distributed search works in Solr:

  • Documents are assigned to shards (“cores”) according to a hash function:
    • Document Core Assignment = Hash(Document ID) % (Shard Count)
  • Each Solr Core holds roughly N / S documents, where N is the number of documents in the index, and S is the Shard Count. This balance is ensured by the randomness of the hash function.

A client request is processed as follows:

  1. The client sends a request to a random cluster node, designated the Aggregator Node.
  2. The aggregator node searches any cores it has locally and then forwards the request to other nodes to get results for the remaining cores.
  3. Each node scores the matching documents, sorts by a score, and returns its top results to the aggregator node.
  4. The aggregator merges all returned results, trims excess, and returns them to the user.

Here is a diagram of these steps:


Now, imagine we have a heavily sharded system with 100 cores, and we’re returning 100 documents per page. When a user asks for page 100, the system needs to sort at least 10,000 documents (100 items per page * 100 pages). Since the system does not know where the top 10,000 documents will be — they could all be on the same core, for example — the aggregator nodes request 10,000 documents from EACH core. This means it needs to retrieve 1 million documents (10,000 docs * 100 cores) to determine the results that should appear on page 100. After it does the merge, it will throw away the 990,000 documents that didn’t make the cut. This 99% discard rate is costly, performance-wise. The crux of the problem is each core is doing the same sorting work it would have to do in an unsharded system.

One might ask why every core needs to return 10K documents to the aggregator. Why not just have each core return 100 documents, combine them all, trim nothing, and return the results to the user? The problem is that these results would be incorrect. That’s because there’s a strong likelihood that the top 10K results for a given query would not be distributed perfectly evenly across the cluster. For example, below is an illustration of what can happen when a user requests the top 12 results from a system with 4 shards if we try to fetch 25% of the results from each shard:

To avoid this sort of mis-ranking, Solr’s out-of-the-box distributed search solution fetches a full result set from each shard. However, there is a scantily documented Solr feature, shards.rows, that allows the caller to customize the size of the result set fetched from each core by adding the parameters shards.rows and/or shards.start, in addition to the usual start and rows parameters. For example:


The above query tells the aggregator to return 100 results, starting at document 900, but only considering the first 250 rows from each shard. In other words, Solr gives us the tools necessary to trade accuracy for performance, by fetching fewer rows from each shard.

It’s worth noting that if we select the smallest possible shards.rows value which produces a full result set, as in the above illustration, we will have achieved our goal of reducing sort work linearly with the number of shards, at the expense of result accuracy.

So far we’ve looked at two solutions on opposite ends of the spectrum of the accuracy performance trade-off. Let’s define a new term to represent this concept:

Shard Factor = shards.rows / (start + rows)

Furthermore, let’s put shards.start aside, as we don’t really need this tool (assume it’s always 0).

So to recap, on one hand, we have the default Solr solution, which effectively uses Shard Factor=100%; on the other, he have a solution that heavily compromises result accuracy by setting Shard Factor = 1 / (Number of shards)

Next question: Is there a value for shard factor we can calculate that will guarantee a certain percent accuracy assuming random document distribution across shards?

In fact, there is! This has been discussed in the Solr community here, and we found a paper on this subject published by Yahoo:

Unfortunately, this formula is recursively defined, and in our implementation, got prohibitively expensive to calculate as the variables got large. However, we found a much faster, more intuitive way of estimating an appropriate shard factor using a Monte Carlo Simulation technique, as shown in this script which we’ve open sourced.

Using this tool, you can accurately estimate what shard factor is needed to guarantee that 99% of result pages are completely correct, for a given shard configuration. Due to the “law of large numbers”, this shard factor drops as you page deeper into the results, and the result set distribution evens out. Specifically:

  • With 4 shards, we found that for results [0-1,000], we needed a shard factor or 38% to guarantee 99% accuracy, but for results above 1,000, 28% was sufficient.
  • With 100 shards, we found that for results 0-1,000, we needed a shard factor or 6% to guarantee 99% accuracy, but for results above 1,000, 2% was sufficient.

It’s worth noting that these shard factors are quite close to the minimum possible values of 25% and 1%, which were highly inaccurate (in fact the probability of result correctness is nearly 0 for those value because they require an exactly even distribution of results).

This means we were able to get very near 100% accuracy, while still achieving effectively linear speedup as we distributed our search engine across more shards.

How we applied this work:

Historically, Shutterstock has sent deeply paginated queries to a separate “deep page cluster” to prevent these queries, which are often issued by spiders, from impacting more common low page user traffic. Our testing showed that when we tried to shard our deep page cluster, the performance worsened because the increased heap pressure caused by the additional sorting and aggregation work outweighed the benefits of needing to score fewer documents per shard. The shard factor trick enabled us to get the benefits of more sharding without the cost. Our performance test of adding this feature to a 4-shard deep page cluster showed a 60% drop in latency and a 95% drop in the error rate in slow-running high start queries, increasing our max throughput by 125%. In fact, since we took this action, “deep page” query performance has been roughly on-par with regular query performance.

Further Work:

Since we’ve mitigated performance issues with deep page queries, we’ve begun to consider combining our deep page cluster with our main cluster and increasing the shard count. Performance test results of a 12-shard combined pool show a 31% drop in average latency, a 37% drop in 90% latency, and the elimination of a 0.91% timeout rate:

We’re hoping this performance improvement will enable us to jettison the high start pool in the near future.

Another class of problematic queries we think this technique may help with is high result queries. These are queries which match nearly every document in our index and can be slow because they require scoring 100M+ matching documents. We’ve found that if we increase the number of shards, the scoring time is parallelized, allowing a roughly linear drop in query processing time. However, we haven’t been able to take advantage of this in the past because higher shard counts caused a lot of redundant sorting work to be done. With this technique, we hope to be able to move to a system with a much higher shard count without the negative sorting trade-offs, helping us solve the high result problem too.