Distributed Searching基础

在单机的情况下,当索引越来越大,检索就显得力不从心了。

solr容许我们将索引切开(多个适当大小的索引,称之为shards),并分布到多台“服务器”上。

solr通过一台服务器(single shard)接受检索任务,并将其分发到各个shards上,最后合并检索结果。

详细信息参见:http://wiki.apache.org/solr/DistributedSearch

 

1.通过shards参数执行Distributed Searching

我们可以检索请求中加入shards参数执行Distributed Searching,其格式为:

host:port/base_url[,host:port/base_url]*

例如:

?
http://localhost:8983/solr3.5/core1/
select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on
&shards=localhost:8983/solr3.5/core0,localhost:8983/solr3.5/core1

 

 

2.Distributed Searching支持的组件

只有以下组件支持Distributed Searching:

  • The Query component that returns documents matching a query
  • The Facet component, for facet.query and facet.field requests where facets are sorted by count (the default). Solr 1.4 and later also support sorting by name.
  • The Highlighting component
  • The Stats component
  • The Spell Check Component
  • The Terms Component
  • The Term Vector Component
  • The Debug component

 

3.Distributed Searching的限定(不足)

Distributed Searching还有种种限定条件,如下:

  • Each document indexed must have a unique key.
    (每个doc都要有唯一标识,因为solr要对结果进行合并)
  • If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones.
    (solr如果发现重复的id,取首!)
  • Inverse-document frequency (IDF) calculations cannot be distributed.
    (idf计算失效,idf牵涉到总文档数,distributed在各个shards进行检索时不方便计算文档总数。)
  • Distributed searching does not support the QueryElevationComponent, which configures the top results for a given query regardless of Lucene's scoring. For more information, see http://wiki.apache.org/solr/QueryElevationComponent.
    (QueryElevationComponent不顾及scoring,有用户对结果进行编辑,那么简单的结果合并也就无从谈起。)
  • The index for distributed searching may become out of date; for example, a document that once matched a query and was subsequently changed may no longer match the query but will still be retrieved.
    (索引会在distributed searching过程中过时。???)
  • Distributed searching supports only sorted-field faceting, not date faceting
    (distributed searching仅支持sorted-field faceting)
  • The number of shards is limited by number of characters allowed for GET method's URI; most Web servers generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to Denial of Service (DoS) attacks.
    (shards数量受GET地址长度的限制)
  • TF/IDF computations are per shard. This may not matter if content is well (randomly) distributed.
    (和第三点类似,tf/idf在各自shard上计算,因此合并出来的scoring排序也不是很“公正”。)

 

来源:http://docs.lucidworks.com/display/solr/Distributed+Search+with+Index+Sharding

发表评论