Distributed Search with Index Sharding

  categories:搜索资料  tags:,   author:

This Guide covers Apache Solr v4.1.

When an index becomes too large to fit on a single system, or when a query takes too long to execute, an index can be split into multiple shards, and Solr can query and merge results across those shards. A single shard receives the query, distributes the query to other shards, and integrates the results. You can find additional information about distributed search on the Solr wiki: http://wiki.apache.org/solr/DistributedSearch.

The figure below compares a single server to a distributed configuration with two shards.

If single queries are currently fast enough and if one simply wants to expand the capacity (queries/sec) of the search system, then standard index replication (replicating the entire index on multiple servers) should be used instead of index sharding.

Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commit commands and deleteByQuery commands are sent to every server in shards.

Update reorders (i.e., replica A may see update X then Y, and replica B may see update Y then X). deleteByQuery also handles reorders the same way, to ensure replicas are consistent.  All replicas of a shard are consistent, even if the updates arrive in a different order on different replicas.

Distributed Support for Date and Numeric Range Faceting

You can now use range faceting for anything that uses date math (both date and numeric ranges).  In addition, you can now use NOW by including FacetParams.FACET_DATE_NOW in the original request to sync remote shard requests to a common ‘now’ time.  For example, using range faceting is a convenient way to keep the rest of your request the same, but check how the current date affects your date boosting strategies.

FacetParams.FACET_DATE_NOW takes as a parameter a (stringified) long that is the number of milliseconds from 1 Jan 1970 00:00, i.e., the returned value from a System.currentTimeMillis() call. This delineates it from a ‘searchable’ time and avoids superfluous date parsing.

NOTE: This parameter affects date facet timing only. If there are other areas of a query that rely on ‘NOW’, these will not interpret this value.

For distributed facet_dates, Solr steps through each date facet, adding and merging results from the current shard.

Any time and/or time zone differences are NOT taken into account here. The issue of time zone/skew on distributed shards is currently handled by passing a facet.date.now=<epochtime> parameter in the search query. This is then used by the participating shards to use as ‘now’.

If you use the first encountered shard’s facet_dates as the basis for subsequent shards’ data to be merged in, if subsequent shards’ facet_dates are skewed in relation to the first by a >1 ‘gap’, these ‘earlier’ or ‘later’ facets will not be merged in.

There are two reasons for this:

  1. Performance: It is of course faster to check facet_date lists against a single map’s data, rather than against each other.
  2. If ‘earlier’ and/or ‘later’ facet_dates are added in, this makes the time range larger than that which was requested (e.g. a request for one hour’s worth of facets could bring back 2, 3, or more hours of data).

Distributing Documents across Shards

It is up to you to get all your documents indexed on each shard of your server farm. Solr does not include out-of-the-box support for distributed indexing, but your method can be as simple as a round robin technique. Just index each document to the next server in the circle. (For more information about indexing, see Indexing and Basic Data Operations.)

A simple hashing system would also work. The following should serve as an adequate hashing function.

uniqueId.hashCode() % numServers

One advantage of this approach is that it is easy to know where a document is if you need to update it or delete. In contrast, if you are moving documents around in a round-robin fashion, you may not know where a document actually is.

Solr does not calculate universal term/doc frequencies. For most large-scale implementations, it is not likely to matter that Solr calculates TD/IDF at the shard level. However, if your collection is heavily skewed in its distribution across servers, you may find misleading relevancy results in your searches. In general, it is probably best to randomly distribute documents to your shards.

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained controlled and you can tune it to target your own specific requirements. The default configuration favors throughput over latency.

To configure the standard handler, provide a configuration like this:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- other params go here -->
     <shardHandlerFactory class="HttpShardHandlerFactory">
        <int name="socketTimeOut">1000</int>
        <int name="connTimeOut">5000</int>
      </shardHandler>
  </requestHandler>

The parameters that can be specified are as follows:

Parameter Default Explanation
socketTimeout 0 (use OS default) The amount of time in ms that a socket is allowed to wait.
connTimeout 0 (use OS default) The amount of time in ms that is accepted for binding / connecting a socket
maxConnectionsPerHost 20 The maximum number of connections that is made to each individual shard in a distributed search.
corePoolSize 0 The retained lowest limit on the number of threads used in coordinating distributed search.
maximumPoolSize Integer.MAX_VALUE The maximum number of threads used for coordinating distributed search.
maxThreadIdleTime 5 seconds The amount of time to wait for before threads are scaled back in response to a reduction in load.
sizeOfQueue -1 If specified, the thread pool will use a backing queue instead of a direct handoff buffer. High throughput systems will want to configure this to be a direct hand off (with -1). Systems that desire better latency will want to configure a reasonable size of queue to handle variations in requests.
fairnessPolicy false Chooses the JVM specifics dealing with fair policy queuing, if enabled distributed searches will be handled in a First in First out fashion at a cost to throughput. If disabled throughput will be favored over latency.

Executing Distributed Searches with the shards Parameter

If a query request includes the shards parameter, the Solr server distributes the request across all the shards listed as arguments to the parameter. The shards parameter uses this syntax:

host:port/base_url[,host:port/base_url]*

For example, the shards parameter below causes the search to be distributed across two Solr servers: solr1 and solr2, both of which are running on port 8983:

http://localhost:8983/solr/select?

shards=solr1:8983/solr,solr2:8983/solr&indent=true&q=ipod+solr

Rather than require users to include the shards parameter explicitly, it is usually preferred to configure this parameter as a default in the RequestHandler section of solrconfig.xml.

Do not add the shards parameter to the standard requestHandler; otherwise, search queries may enter an infinite loop. Instead, define a new requestHandler that uses the shards parameter, and pass distributed search requests to that handler.

Currently, only query requests are distributed. This includes requests to the standard request handler (and subclasses such as the DisMax RequestHandler), and any other handler (org.apache.solr.handler.component.searchHandler) using standard components that support distributed search.

Where shards.info=true, distributed responses will include information about the the shard (where each shard represents a logically different index or physical location), such as the following:

<lst name="shards.info">
  <lst name="localhost:7777/solr">
    <long name="numFound">1333</long>
    <float name="maxScore">1.0</float>
    <long name="time">686</long>
  </lst>
  <lst name="localhost:8888/solr">
    <long name="numFound">342</long>
    <float name="maxScore">1.0</float>
    <long name="time">602</long>
  </lst>
</lst>

The following components support distributed search:

  • The Query component, which returns documents matching a query
  • The Facet component, which processes facet.query and facet.field requests where facets are sorted by count (the default).
  • The Highlighting component, which enables Solr to include “highlighted” matches in field values.
  • The Stats component, which returns simple statistics for numeric fields within the DocSet.
  • The Debug component, which helps with debugging.

Limitations to Distributed Search

Distributed searching in Solr has the following limitations:

  • Each document indexed must have a unique key.
  • If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones.
  • Inverse-document frequency (IDF) calculations cannot be distributed.
  • Distributed searching does not support the QueryElevationComponent, which configures the top results for a given query regardless of Lucene’s scoring. For more information, see http://wiki.apache.org/solr/QueryElevationComponent.
  • The index for distributed searching may become momentarily out of sync if a commit happens between the first and second phase of the distributed search. This might cause a situation where a document that once matched a query and was subsequently changed may no longer match the query but will still be retrieved. This situation is expected to be quite rare, however, and is only possible for a single query request.
  • Distributed searching supports only sorted-field faceting, not date faceting
  • The number of shards is limited by number of characters allowed for GET method’s URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks.
  • TF/IDF computations are per shard. This may not matter if content is well (randomly) distributed.
  • Shard information can be returned with each document in a distributed search by including fl=id, [shard] in the search request. This returns the shard URL.
  • In a distributed search, the data directory from the core descriptor overrides any data directory in solrconfig.xml.
  • Update commands may be sent to any server with distributed indexing configured correctly. Document adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commit commands and deleteByQuery commands are sent to every server in shards.

Avoiding Distributed Deadlock

Each shard may also serve top-level query requests and then make sub-requests to all of the other shards. In this configuration, care should be taken to ensure that the max number of threads serving HTTP requests in the servlet container is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock.

For example,a deadlock might occur in the case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the servlet containers will block the incoming requests until the other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By ensuring that the servlets are configured to handle a sufficient number of threads, you can avoid deadlock situations like this.

Testing Index Sharding on Two Local Servers

For simple functionality testing, it’s easiest to just set up two local Solr servers on different ports. (In a production environment, of course, these servers would be deployed on separate machines.)

  1. Make a copy of the solr example directory:
    cd solr
    cp -r example example7574

     

  2. Change the port number:
    perl -pi -e s/8983/7574/g example7574/etc/jetty.xml example7574/exampledocs/post.sh

     

  3. In the first window, start up the server on port 8983:
    cd examplejava -server -jar start.jar

     

  4. In the second window, start up the server on port 7574:
    cd example7574java -server -jar start.jar

     

  5. In the third window, index some example documents to each server:
    cd example/exampledocs./post.sh \[a-m\]\*.xmlcd ../../example7574/exampledocs./post.sh \[n-z\]\*.xml

     

  6. Now do a distributed search across both servers with your browser or curl:
    curl 'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'


快乐成长 每天进步一点点      京ICP备18032580号-1