标签归档:solr

Solr请求过程执行概述

Solr的请求(包括索引数据更新和查询)都是通过 SolrCore类的

execute(SolrRequestHandler handler, SolrQueryRequest req, SolrQueryResponse rsp) 方法来执行的。

其中第一个参数表示执行这一过程的处理器,这些处理器都在handler包下,例如XmlUpdateRequestHandler。

第二个参数表示请求其中包含请求的参数(通常是查询时我们 通过表单传入的参数)以及请求的内容流(例如更新索引时发送的xml格式的数据)。

第三个参数代表响应结果,handler处理请求后会将结果保存在第三个参数中。

SolrQueryRequest 接口的谱系图

 

其中有斜线标示的方法已经废弃

 

SolrQueryResponse接口的谱系图

 

 

索引更新过程

xml格式的更新数据被保存在请求的streams : Iterable<ContentStream>参数中。其中的ContentStream接口谱系图如下:

 

 

注意这里的GiantContentStream是我自己加入的类。 这个类继承自ContentStreamBase并且改写了构造函数与InputStream getStream()方法。这个方法正是更新时被处理器读取数据时调用的关键。看一看这个类的大纲也就明白了这个类的本质。

 

我们看到这样的代码:
final NamedList<Object> responseHeader = new SimpleOrderedMap<Object>();
rsp.add("responseHeader", responseHeader);
这段代码的作用是什么呢?它是创建一个请求头的载体,然后将其加入到响应类中,这样我们就可以在响应中获得我们请求头,这个请求头中包含我们的请求参数。
关于NameList这个类,说明文件说明中说它是一个name/value对的有序表,具有如下特性:

  • name可以重复
  • name/value 对的有序
  • 可以通过index访问
  • name和value都可以为空

看看它的谱系图:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

再看这样一段代码:
NamedList toLog = rsp.getToLog();
//toLog.add("core", getName());
toLog.add("webapp", req.getContext().get("webapp"));
toLog.add("path", req.getContext().get("path"));
toLog.add("params", "{" + req.getParamString() + "}");
toLog是什么呢?它是SolrQueryResponse类的一个属性,定义如下:

protected NamedList toLog = new SimpleOrderedMap();
上面我们已经看到了SimpleOrderedMap类是NamedList的子类,这里通过属性的名字不难 想象它是记录需要写入日志中的内容的 载体。的确我们在打印出来的日志中很容易找到如下内容:
信息: [database] webapp=null path=null params={} status=0 QTime=1438

执行更新关键的代码是这里:

handler.handleRequest(req,rsp);
setResponseHeaderValues(handler,req,rsp);

第一句执行更新过程,第二句只是将req中的参数部分以及经过处理后的rsp中的toLog中的内容写如到rsp中的请求头 (responseHeader )中来。

我们来看最关键的第一句: handler.handleRequest(req,rsp);

 

关于 handler.handleRequest(req,rsp)。更新常用的handler是XmlUpdateRequestHandler,所以我们 进入该类的handleRequest方法中查看究竟。我们发现XmlUpdateRequestHandler中并没有该方法,我们猜想该方法可以在父 类中,果然。于是我们就研究父类的handleRequest方法。

 

SolrPluginUtils.setDefaults(req,defaults,appends,invariants);
该方法内部的代码如下:
SolrParams p = req.getParams();
if (defaults != null) {
p = new DefaultSolrParams(p,defaults);
}
if (appends != null) {
p = new AppendedSolrParams(p,appends);
}
if (invariants != null) {
p = new DefaultSolrParams(invariants,p);
}
req.setParams(p);
AppendedSolrParams是DefaultSolrParams的子类,可以看出来,这段代码的作用是设置req使其具有实质性的参数内容。

代码:rsp.setHttpCaching(httpCaching);
作用:设置是否进行缓存功能。

代码:handleRequestBody( req, rsp );
这段代码是处理体,由于我们获得的是XmlUpdateRequestHandler的实例,所以我们又要跳到 XmlUpdateRequestHandler类的handleRequestBody方法中了。handleRequestBody所做事情如下:

  • SolrParams params = req.getParams();获得请求参数内容。
  • UpdateRequestProcessorChain processingChain =
    req.getCore().getUpdateProcessingChain( params.get( UpdateParams.UPDATE_PROCESSOR ) );获得更新请求处理链。
  • UpdateRequestProcessor processor = processingChain.createProcessor(req, rsp);获得更新请求处理器。
  • Iterable<ContentStream> streams = req.getContentStreams();获得更新请求的内容流。
  • 如果streams为空,进行提交的尝试处理,处理失败抛出错误。如果不为空,如下处理:

for( ContentStream stream : req.getContentStreams() ) {
Reader reader = stream.getReader();
try {
XMLStreamReader parser = inputFactory.createXMLStreamReader(reader);
this.processUpdate( processor, parser );
}
finally {
IOUtils.closeQuietly(reader);
}
}
这段处理的本质是循环获取每部分内容流,然后获得内容流的Reader实例。接下构建XMLStreamReader 实例并调用this.processUpdate( processor, parser );来完成更新。最后尝试进行提交(仅仅是可能,因为我们可能在更新的同时在请求参数中加入commit参数,并设其值为true)。

  • processor.finish();

我们一步一步来观察这个详细的过程。
首先是获得处理链,版本的solrconfig.xml配置文件,所以没有添加处理链的配置参数。所以我们在 eq.getCore().getUpdateProcessingChain方法中传入的参数是null,但是尽管如此依然获得了一个包含两个 UpdateRequestProcessorFactory的处理链。虽然我们知道Map类型中可以包含键为null的内容,但是这个内容是怎么获得的呢?
实际上SolrCore的构造函数中调用了SolrCore类的loadUpdateProcessorChains方法,这就是加载处理链的地方。在这 段代码中我们看到这两个UpdateRequestProcessorFactory的原型:
new RunUpdateProcessorFactory(),
new LogUpdateProcessorFactory()。
也就是说在没有进行处理链配置的时候,就使用它们作为默认值。
我们从RunUpdateProcessorFactory来看更新处理器工厂的本质。它是RunUpdateProcessor的工厂,工厂的本质是产 生实例,所以它最重要的方法是getInstance(SolrQueryRequest, SolrQueryResponse, UpdateRequestProcessor),这里的最后一个参数是下一个更新请求处理器,这就是“链”的由来。那么 RunUpdateProcessor是什么呢?请看下面。

 

 

 

 

 

 

 

 

RunUpdateProcessor实质是处理添加,提交,删除的地方,每个更新方法的参数是UpdateCommand的子类。也就是最终我们的更新 请求都会通过这里来进行处理。而LogUpdateProcessorFactory又有所不同,它是关于日志处理器 (LogUpdateProcessor)的一个工厂类。

然后,我们利用获得的处理链来构建处理器:
UpdateRequestProcessor processor = processingChain.createProcessor(req, rsp);
在createProcessor方法的内部,我们看到每个处理器工厂都被轮流调用了,也就是最终返回的processor 是和一连串的处理器相关的。

最后,我们来看如何利用获得的处理器以及多个内容流streams来处理更新请求的。
for( ContentStream stream : req.getContentStreams() ) {
Reader reader = stream.getReader();
try {
XMLStreamReader parser = inputFactory.createXMLStreamReader(reader);
this.processUpdate( processor, parser );
}
finally {
IOUtils.closeQuietly(reader);
}
}
我们在一个for循环中对每个ContentStream 作如下处理。先获得其Reader的实例,然后利用xml分析工具构建对Reader的解析字符流。
XMLStreamReader parser = inputFactory.createXMLStreamReader(reader);
使用this.processUpdate( processor, parser );来处理更新。

 

XmlUpdateRequestHandler的processUpdate( processor, parser )做了什么事情呢?请跟随我来看一看。
主要到while (true)这样一个循环,在循环内部是对parser :XMLStreamReader的循环读取与处理。这个循环中有一个switch语句,:
int event = parser.next();
switch (event){
case XMLStreamConstants.END_DOCUMENT:
parser.close();
return;
case XMLStreamConstants.START_ELEMENT:
String currTag = parser.getLocalName();//获取当前标签名

if (currTag.equals(ADD)) {
............
//这里主要是设置addCmd的一些属性
}
else if ("doc".equals(currTag)) {
......
processor.processAdd(addCmd);
......
}
else if ( COMMIT.equals(currTag) || OPTIMIZE.equals(currTag)) {
......
processor.processCommit( cmd );
......
}
else if (DELETE.equals(currTag)) {
......
processDelete( processor, parser);
......
}
break;
}

 

在这个过程中,我们看到xml解析时,发现当前标签,然后判断其内 容。如果内容为add,那么使用
addCmd = new AddUpdateCommand()来创建新的addCmd实例,并设置其overwriteCommitted等属性。如果标签内容为doc,使用如 下代码先清空addCmd的内容,然后从xml数据中读取一 个新的Doc并将其放如addCmd中。
addCmd.clear();
addCmd.solrDoc = readDoc( parser );
然后调用processor.processAdd(addCmd)方法来进行处理,至于commit与delete也是类似的方法处理。

 

下面我们来看processor.processAdd(addCmd)做了些什么,清楚它的实质以后,关于提交与删除的处理就可以用同样的方法来进行处 理了。
通过简单的信息输入发现,这里有三个类的processAdd方法被调用了。首先调用的是RunUpdateProcessor的processAdd方法,在这个方法中调用 了父类(UpdateRequestProcessor)的processAdd方法,调用父类的方法的时候我们注意到方法中有if (next != null) next.processAdd(cmd)代码,正是这个代码的调用使得LogUpdateProcessor中的processAdd得以调用,这正是 处理得以实现的根本。也就是说之前构造时候的链式仅仅时使其具有链式的能力,而调用父类 (UpdateRequestProcessor)的processAdd方法使得链式真正发挥了作用。

 

下面我们沿着RunUpdateProcessor、UpdateRequestProcessor、LogUpdateProcessor这条主线一步 一步来看。
在RunUpdateProcessor中的processAdd方法中。通过如下的代码获得了cmd的doc属性,这就是说将原来的 SolrInputDocument引入了schema信息建立了lucene下 的docement。
cmd.doc = DocumentBuilder.toDocument(cmd.getSolrInputDocument(), req.getSchema());
然后通过下面的代码来实现 lucene下的文档对象的更新。
updateHandler.addDoc(cmd);
这里的updateHandler是什么东西呢?输出updateHandler.getClass().getName()信息很容易发现该类是 org.apache.solr.update.DirectUpdateHandler2。这是更新默认的处理器,也可以在配置文件中进行设定。
我们进入该类看个究竟吧。
简单浏览代码概要发现所有的更新处理最终都落实到这里,常常惹人头痛的“两把锁”--提交锁与访问锁也都在这里。暂且先不管这么多吧,我们先来看看我们关心的addDoc方法。

 

 

我们进入了DirectUpdateHandler2中的addDoc方法,现在我们来看看里面发生了什么惊天动地的事情吧。
前面几行代码是增加一些记录添加的文档有多少等的一 些信息以及如果当前ID域不存在的情况下允许重复的处理,暂且别管它。我们从读索引访问加锁开始。
iwAccess.lock();
获得锁以后我们在一个try语句中有如下的代码:
synchronized (this) {
// adding document -- prep writer
closeSearcher();
openWriter();
tracker.addedDocument();
}
这段代码做了什么呢?一个一个的分析。

closeSearcher()
关闭搜索器,这样使得没有搜索器会从索引文件读取内容。我试图证明 这个观点,但是还没有找到直接的证据,暂时保留意见吧。

openWriter();
这个过程先后经过了下面的方法:

DirectUpdateHandler2的 openWriter()
UpdateHandler的createMainIndexWriter("DirectUpdateHandler2", false)方法。第二个参数表示是否removeAllExisting。
UpdateHandler的SolrIndexWriter(name,core.getIndexDir(), removeAllExisting, schema, core.getSolrConfig().mainIndexConfig);参数分别为名字,索引目录路径,是否移去所有已经存在的内容,代表schema.xml内容的实例,代表索引的相关配置属性的实例。
最后就是调用lucene的 IndexWriter的构建方法了。
这一切的目的就是使得当前DirectUpdateHandler2的实例具备一个可用的IndexWriter的实例,注意这里是lucene下的 IndexWriter了。

tracker.addedDocument()
tracker是CommitTracker的实例,作为DirectUpdateHandler2的属性,我把它叫做提交跟踪器。代码说明中这样介绍: 这是一个跟踪自动提交的帮助类。

为了保持思路的完整性暂且别管这个类,后面再来理它。先来看看后面的代码。
后面其实是先获得cmd中的overwriteCommitted以及overwriteCommitted的值来决定如何添加文档到索引中。一种是允许重复ID的添加,另外一种是不允许重复的添加。代码如下:
if (cmd.overwriteCommitted || cmd.overwritePending) {
if (cmd.indexedId == null) {
cmd.indexedId = getIndexedId(cmd.doc);
}
//id唯一的添加方式
writer.updateDocument(idTerm.createTerm(cmd.indexedId), cmd.getLuceneDocument(schema));
} else {
// allow duplicates
writer.addDocument(cmd.getLuceneDocument(schema));
}

 

 

了解一下SOLR的跟踪器。先来看看构造函数。
public CommitTracker() {
docsSinceCommit = 0;
pending = null;

docsUpperBound = core.getSolrConfig().getInt("updateHandler/autoCommit/maxDocs", -1);
timeUpperBound = core.getSolrConfig().getInt("updateHandler/autoCommit/maxTime", -1);

SolrCore.log.info("AutoCommit: " + this);
}
docsSinceCommit记录自上一次提交以来已经具有的doc的数目,pending 是ScheduledFuture的对象,显然这个类里面用到了多线程。docsUpperBound和 timeUpperBound是从配置文件获得的提交条件,doc的上界以及时间的上界。
再来看看addedDocument方法吧,这个方法没有参数。首先利用下面的两行代码来对已经添加的doc数目进行递增,同时设置当前时间给 lastAddedTime ,也就是获得最近添加doc时的系统时间。
docsSinceCommit++;
lastAddedTime = System.currentTimeMillis();
下面是一个条件为docsUpperBound > 0 && (docsSinceCommit > docsUpperBound)的if语句,这个语句的意思是,在doc数目上界大于0且自上次提交以来的doc数目已经超过了允许的上界,那么做以下事情:
如果上次提交延迟,中断上次提交延迟的任务。为pending分配一个新的任务(这里就是CommitTracker本身)。这段代码的本质就是在doc 已经超过允许的值的情况下进行提交。
接下来就是分配一个超时触发的提交任务,也就是在设定的时间到了以后触发提交过程。

既然schedule方法中的Runnable参数是this,也就是提交跟踪器,那么我们就有必要来看一看它的run方法了。这个方法包括提交的所有过程。

CommitUpdateCommand command = new CommitUpdateCommand( false );
command.waitFlush = true;
command.waitSearcher = true;
//no need for command.maxOptimizeSegments = 1; since it is not optimizing
commit( command );
autoCommitCount++;

这里首先创建一个提交更新命令的对象,并设置构造函数的参数为false,意思是optimize(优化)为false。r然后设置其两个属性 waitFlush 和waitSearcher 为true,然后调用commit方法处理刚才创建的提交更新命令的对象。最后将自动更新计数增1。不管自动提交过程成功或者失败,都会在finally 语句块中将pending设为null,以取消pending对线程的监听。

最后的几句代码是判断是否又有新的内容添加进来(通过lastAddedTime > started否判断),如果有又重复这样一个过程:

1.如果已经添加进来的doc的数目大于允许的doc的上界,那么启动一个自动提交线程。

2.如果超时时间大于0,那么也启动一个自动提交线程,这里的时间延迟是时间上界。

如此说来,这个run中最重要的实质部分就是commit(command)方法,我们必须知道这个方法做了什么。按照顺序来解释代码。

if (cmd.optimize) {
optimizeCommands.incrementAndGet();
} else {
commitCommands.incrementAndGet();
}

根据是否优化(optimize)来设置是优化计数器递增还是一般提交计数器递增。

————————————————————

Future[] waitSearcher = null;
if (cmd.waitSearcher) {
waitSearcher = new Future[1];
}

构造一个Future的空数组,然后根据是否设置waitSearcher(决定是否在获得新的Searcher后继续进行还是立即先下执行)来决定是否为这个Future数组进行实例分配。如果对其进行分配,后面的代码会在某个位置利用waitSearcher来等待其关注的代码执行完毕后继续向前执 行。

iwCommit.lock();
try {
log.info("start "+cmd);

if (cmd.optimize) {
closeSearcher();
openWriter();
writer.optimize(cmd.maxOptimizeSegments);
}

closeSearcher();
closeWriter();

callPostCommitCallbacks();
if (cmd.optimize) {
callPostOptimizeCallbacks();
}
// open a new searcher in the sync block to avoid opening it
// after a deleteByQuery changed the index, or in between deletes
// and adds of another commit being done.
core.getSearcher(true,false,waitSearcher);

// reset commit tracking
tracker.didCommit();

log.info("end_commit_flush");

error=false;
}
finally {
iwCommit.unlock();
//TEST
log.info("提交锁解锁2");
addCommands.set(0);
deleteByIdCommands.set(0);
deleteByQueryCommands.set(0);
numErrors.set(error ? 1 : 0);
}
这里使用锁来进行同步它之后执行的代码。如果需要优化(optimize),那么关闭关联的搜索器,打开写索引器,然后调用 writer.optimize(cmd.maxOptimizeSegments)方法。该方法的作用是使得索引断的数目少于 maxOptimizeSegments。

closeSearcher();
closeWriter(); li

为了防止上面的if没有成立(即优化没有进行)的时候没有进行搜索器和写索引器的关闭,那么我们这里再进行一次以确保确实被执行了。

callPostCommitCallbacks();
if (cmd.optimize) {
callPostOptimizeCallbacks();
}

对于这段代码,目前不知道有什么作用。先不理它了。

core.getSearcher(true,false,waitSearcher);
这里显然是重新获得搜索器,这里正是提交过程是精髓所在,所谓提交就是使得刚才的更新生效,也就是搜索器的重新建立。因为我们的查询结果都是从搜索器获得 的。这里将waitSearcher作为参数传递给了getSearcher方法,也为后面的等待searcher获取过程完毕才继续执行埋下了伏笔。

tracker.didCommit();只是完成reset工作。finally里面的块也是完成一些变量的重设工作。

最后这段代码:

if (waitSearcher!=null && waitSearcher[0] != null) {
try {
waitSearcher[0].get();
} catch (InterruptedException e) {
SolrException.log(log,e);
} catch (ExecutionException e) {
SolrException.log(log,e);
}
}

这正是上面埋下的伏笔,这里通过查看waitSearcher来获知搜索器建立的线程是否已经完成了,如果没有完成,等待,如果完成了那么继续向下运行。
我们知道了所谓提交就是使得更新处理 对查询生效。也就是搜索器的重新建立,这里我们就来看看这个搜索器重新获得的过程是怎么实现的。 也就是DirectUpdateHandler2类commit方法中的这一小句代码所做的事情: core.getSearcher(true,false,waitSearcher)。
我们来看看SolrCore类中的getSearcher(boolean forceNew, boolean returnSearcher, final Future[] waitSearcher)方法的参数说明。
forceNew:是否强制建立新的,因为可以利用原有的来建立,故由此 一说。 (提交当然是ture)
returnSearcher:是否需要返回一个 Searcher。(提交false即可,因为没有必要返回新的)
waitSearcher:这里将填 充一个Future,这个Future要直到获得一个搜索器以后才返回。利用它可以获得waitSearcher功能。

 

下面来看看代码。一开始就来了一个synchronized语句块。这个语句快说明如下:
// it may take some time to open an index.... we may need to make
// sure that two threads aren't trying to open one at the same time
// if it isn't necessary.
翻译一下先: 打开一个索引会花一 点时间...如果没有必要,我们需要确保没有两个线程同时打开一个索引。
synchronized (searcherLock) {
// 看是否有一个搜索器可以直接返回。前提:不需要强制返回新的搜索器。
if (_searcher!=null && !forceNew) {...
}

// 看是否有正在建立中的搜索器,如果有,我们等待它建立
if (onDeckSearchers>0 && !forceNew && _searcher==null) {
try {
searcherLock.wait();
} catch (InterruptedException e) {
log.info(SolrException.toStr(e));
}
}

//经过了上面的等待,我们再次查看是否有可用的搜索器可以返回了
if (_searcher!=null && !forceNew) {...
}

到此位置,既然没有可以直接返回的,或者需要强制建立新的,我们只好开始建立过程了。
首先设置一个信号,通知其它线程,一个新的搜索器正在建立过程中。先开始onDeckSearchers合法性检查。
onDeckSearchers++;
if (onDeckSearchers < 1) {
// 检查onDeckSearchers 是否小于1,因为上面进行了加加操作,所以不 应小于1,但是
//只是出于编程习惯的检查,实际上并不会发生。
log.severe(logid+"ERROR!!! onDeckSearchers is " + onDeckSearchers);
onDeckSearchers=1; // reset
} else if (onDeckSearchers > maxWarmingSearchers) {
//如果超过了最大允许的“预热”搜索器的数目,抛出错误
} else if (onDeckSearchers > 1) {//仅仅输出警告信息,因为搜索器的建立也不是随便允许多个的
log.info(logid+"PERFORMANCE WARNING: Overlapping onDeckSearchers=" + onDeckSearchers);
}
}

如果把搜索器的建立比作一个工程,那么上面的过程是对这个工程的可行性,环境污染性,资源消耗程度等方面的评估。下面就正是开工咯。
首先建立一个新的搜索器,如果这个搜索器建立的过程失败了,我们在一个同步 块中进行处理(onDeckSearchers递减)。失败后我们没有什么可以继续的了,抛出错误并返回。如果成功了那么还要继续进行一系列的处理。下面 代码中的 tmp = new SolrIndexSearcher....就是获得一个搜索器的过程。它有可能失败。即使成功,我们仍然需要进行进一步的处理,别忘记了我们的返回类型 是RefCounted<SolrIndexSearcher> 而不是SolrIndexSearcher类型,而且新建立的搜索器是”裸体“的,没有任何内容的。
SolrIndexSearcher tmp;
try {
tmp = new SolrIndexSearcher(this, schema, "main", IndexReader.open(FSDirectory.getDirectory(getIndexDir()), true), true, true);
} catch (Throwable th) {
synchronized(searcherLock) {
onDeckSearchers--;
// notify another waiter to continue... it may succeed
// and wake any others.
searcherLock.notify();
}
// need to close the searcher here??? we shouldn't have to.
throw new RuntimeException(th);
}
这里简单写出SolrIndexSearcher的构造函数。根据参数名很容易明白其含义。

SolrIndexSearcher(SolrCore core, IndexSchema schema, String name, IndexReader r, boolean closeReader, boolean enableCache);

我们假设新的处理器的建立成功了,也就是tmp确实获得了一个搜索器,那么我们就需要进一步的处理,注意成功获得tmp的情况下 onDeckSearchers没有递减,且 searcherLock.notify()也还没有执行。我们可以猜想,下面的处理中肯定会出现这两个语句。
处理包括哪些呢?
final SolrIndexSearcher newSearcher=tmp;//将获得的搜索器引用赋予newSearcher,从此newSearcher就代表了这个新建立 的搜索器。

newSearchHolder=newHolder(newSearcher);//将新获得的搜索器包装成newSearchHolder,使其具备引用计数功能

根据returnSearcher是否为ture,决定newSearchHolder.incref()是 否执行,也即其引用计数是否增加。

下面这三个语句是用来记录是否需要搜索器计数的递减和是否已经完成注册。如果在后面的检查中发现 decrementOnDeckCount[0]=true,我们还需要执行递减操作。

final boolean[] decrementOnDeckCount=new boolean[1];
decrementOnDeckCount[0]=true;
boolean alreadyRegistered = false;

接下来A.如果_searcher为null且useColdSearcher为ture(意思是原有的_searcher为空,并且配置文件中指定使用 “冷的搜索器")那么我们将newSearchHolder 进行注册(注册时已经执行了onDeckSearchers--与 searcherLock.notifyAll()),同时 decrementOnDeckCount[0]=false; boolean alreadyRegistered = true;(意思是后面通过对这两个变量的检查就可以确定已经注册完毕且建立计数也已经递减了)。B.如果 _searcher不为空,那么currSearcherHolder=_searcher; currSearcherHolder.incref();(也就是这里currSearcherHolder获 得了原有的_searcher)。

接下来的这段代码currSearcher = currSearcherHolder==null ? null : currSearcherHolder.get();她的本质是使得currSearcher保存了原有的_searcher内容。

到这里的时候,我们可以肯定,如果原有_searcher不为空,那么 currSearcher中是有内容的,而如果原来 _searcher为空,这里的currSearcher也就为空,下面的处理就是:

A.如果currSearcher不为空(请注意这时候它来自_searcher),我们将它预热newSearcher。

B.如果currSearcher为空,也就是我们不具备直接预热的来源,如果这时候具有一个第一次(也就是之前没有任何搜索器)建立搜索器的监听器(firstSearcherListeners),那么我们就可以让这 个监听器来处理newSearcher。

C如果currSearcher不为空,也就是原有的_searcher具有内容,如果这时候具有具有一个新搜索器建立时的监听器(newSearcherListeners),那么我们就用这个监听器来 根据currSearcher来对newSearcher进行处理。

最后看是不是已经注册了,因为前面的处理中仅仅在获得了newSearcher且_searcher为空的时候才有可能进行注册,也就是假如 _searcher不为空的情况下,newSearcher很可能是没有注册的,这样的话,我们这里就需要进行一个注册。

最后剩下的就是决定是否需要返回搜索器了,同时决定是否需要将与搜索器建立过程相关的future与传递进来的future参数进行关联。

对于像监听器(newSearcherListeners与firstSearcherListeners)与搜索器注册这样的概念我们还没有理解。

主要是想弄清楚Solr1.5中的搜索器注册以及建立新搜索器监听器和建立第一次搜索器监听器的概念,同时弄清楚预热是个什么概念。

 

 

注册搜索器

SolrCore::registerSearcher(RefCounted<SolrIndexSearcher> newSearcherHolder)
方法目的:用newSearcherHolder来替代_searcher(SolrCore的属性),同时该方法内部的下面代码完成了对 newSearcherHolder所持有的搜索器的注册。
SolrIndexSearcher newSearcher = newSearcherHolder.get();
newSearcher.register(); //

newSearcher.register()做了什么事情呢?先看代码。
>>>>
core.getInfoRegistry().put("searcher", this);
core.getInfoRegistry().put(name, this);
for (SolrCache cache : cacheList) {
cache.setState(SolrCache.State.LIVE);
core.getInfoRegistry().put(cache.name(), cache);
}
registerTime=System.currentTimeMillis();
1.将搜索器的信息加入核的信息记录器并与searcher关联
2.将搜索器的信息加入核的信息记录器并与搜索器的名字关联
3.将与这个搜索器关联的每个缓存的名字与实例的管理加入核的信息记录器中
4.设置当前搜索器的注册时间

  • firstSearcherListeners

在SolrCore中有一个事件监听器(SolrEventListener)的列表firstSearcherListeners。
SolrEventListener是一个接口,尽管它有init(NamedList args),postCommit(),newSearcher(SolrIndexSearcher newSearcher, SolrIndexSearcher currentSearcher)三个方法,现在我们只关心其newSearcher方法。SolrEventListener类谱系图如下:

 

 

 

 

 

这两个监听器都是在solrconfig.xml中设置的。但是如果没有设置怎么办呢,应该有一个默认值。但是发现其实如果配置文件中没有设置的话,那么 这个监听过程是不存在的。也就是什么都不住。

  • newSearcherListeners

与firstSearcherListeners类似,不赘述。

  • 预热

SolrIndexSearcher::warm(SolrIndexSearcher old)
这个预热过程就是将old搜索器中的内容拷贝到本方法所属的搜索器中来。一般每个搜索器具有三个缓存 filterCache,queryResultCache,documentCache,我把他们分别叫做过滤缓存,查询结果缓存,文档缓存。

solr-wiki—-solr分布式索引

Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引;也可以通过Http Get操作提出查找请求,并得到XML格式的返回结果。
Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的全文搜索引擎。

What is Distributed Search?

当一个索引越来越大,达到一个单一的系统无法满足磁盘需求的时候,或者一次简单的查询实在要耗费很多的时间的时候,我们就可以使用solr的分布式索引 了。在分布式索引中,原来的大索引,将会分成多个小索引(索引并不定规模小,之前称之为小索引相对于之前的整个索引来讲的),solr可以将从这些小索引 返回的结果合并,然后返回给客户端。

 

如果当前的solr查询请求能够很快被处理,而你只是希望整个搜索系统的处理能力,那么你可以看看这篇文章http://wiki.apache.org/solr/CollectionDistribution。

 

Distributed Searching

参数“shards”能够使请求被分发到shards所指定的小索引上。

shards 的格式  :host:port/base_url[,host:port/base_url]*

目前只有查询请求才能够被分发。能够处理这种要求分发的请求的组件包括standard request handler 、以及它的子类、任何使用 支持分发的组件的其他handler。

 

目前支持分布式查询的组件有:

  • 查询组件-----根据一个请求返回一个文档集。
  • 面查询组件-----目前只支持排序的数字类的域。solr1.4将可以支持字符类型的域
  • 高亮显示组件
  • debug组件。

对于分布式搜索组件,你也可以参考   WritingDistributedSearchComponents 。

 

Distributed Searching Limitations

  • 索引的文档必须有一个 唯一键
  • 当拥有相同的id的文档被查询到时,solr将选择第一个文档。其他的结果将被丢弃。
  • No distributed idf
  • 不支持“竞价排名”组件 QueryElevationComponent
  • 在处理的过程中,也即是 stage过程中,索引是有可能发生变化的。这样在索引更新的时刻,可能会发现搜索的结果跟索引不匹配的现象。
  • 目前并不支持 时间类型的 facetting搜索。(将在solr1.4被支持)
  • 目前只支持 能够排序的域的 facet搜索
  • shard 的数量会受到get请求的方法的限制,大多数的web 服务器,只支持4000个字符左右的get请求。
  • 当“start”这个参数很大的时候,效率比较低下。举个例子,你发起一个请求,参数start=50000 ,rows=25 .被请求的索引的每个shard有500,000个文档。这时候,就会导致500,000条记录,在网络上被传输。

Distributed Deadlock

每个小索引,都有可能既接收顶级的 查询请求,也 向其他的小索引发出次级的查询请求。在这里,你就要小心了,http服务器中设置的能够处理的http请求的线程数一定要大于 这个小索引所接收到的 顶级请求和次级请求的总和。如果配置没有这样配好的话,一个由分发而产生的死锁很有可能会发生。

现在我们来试想下一种最简单的情况,只有两个小索引,每个小索引只能够起一个线程来处理Http 请求。两个索引都能接收顶级请求,并将请求分发给另一个索引。因为它们没有多余的线程可以处理多于一个的请求,所以 servlet容器就会阻塞 刚来的请求,这种情况一直持续到正在处理的请求被处理完毕。(但是,那个正在处理的请求永远也不会被处理完毕,因为它正在等待它分发出去的请求的回应,而 这个分发出去的请求被阻塞了)。

关于死锁,目前笔者想到的方法是设定分发请求的超时时间。这个应该不难实现。

 

Distributed Indexing

怎样建立小索引,这点随用户的喜好而定。一种很简单的方法决定那条记录放在那个索引上可以使用类似这样的一种公式 uniqueId.hashCode() % numServers。

Example

出于测试目的,我们在一台机器的两个不同的端口启动 solr 服务。

 

Shell代码  收藏代码
  1. #make a copy
  2. cd solr
  3. cp -r example example7574
  4. #change the port number
  5. perl -pi -e s/8983/7574/g example7574/etc/jetty.xml  example7574/exampledocs/post.sh
  6. #in window 1, start up the server on port 8983
  7. cd example
  8. java -server -jar start.jar
  9. #in window 2, start up the server on port 7574
  10. cd example7574
  11. java -server -jar start.jar
  12. #in window 3, index some example documents to each server
  13. cd example/exampledocs
  14. ./post.sh [a-m]*.xml
  15. cd ../../example7574/exampledocs
  16. ./post.sh [n-z]*.xml
  17. #now do a distributed search across both servers with your browser or curl
  18. curl 'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'

Solr Index Replication

Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引;也可以通过Http Get操作提出查找请求,并得到XML格式的返回结果。
Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的全文搜索引擎。

来源:https://cwiki.apache.org/confluence/display/solr/Index+Replication

The Lucene index format has changed with Solr 4. As a result, once you upgrade, previous versions of Solr will no longer be able to read the rest of your indices. In a master/slave configuration, all searchers/slaves should be upgraded before the master. If the master is updated first, the older searchers will not be able to read the new index format.

Index Replication distributes complete copies of a master index to one or more slave servers. The master server continues to manage updates to the index. All querying is handled by the slaves. This division of labor enables Solr to scale to provide adequate responsiveness to queries against large search volumes.

The figure below shows a Solr configuration using index replication. The master server's index is replicated on the slaves.


A Solr index can be replicated across multiple slave servers, which then process requests.

Index Replication in Solr

Solr includes a Java implementation of index replication that works over HTTP.

For information on the ssh/rsync based replication, see Index Replication using ssh and rsync.

The Java-based implementation of index replication offers these benefits:

  • Replication without requiring external scripts
  • The configuration affecting replication is controlled by a single file, solrconfig.xml
  • Supports the replication of configuration files as well as index files
  • Works across platforms with same configuration
  • No reliance on OS-dependent hard links
  • Tightly integrated with Solr; an admin page offers fine-grained control of each aspect of replication
  • The Java-based replication feature is implemented as a RequestHandler. Configuring replication is therefore similar to any normal RequestHandler.

Replication Terminology

The table below defines the key terms associated with Solr replication.

TermDefinition
CollectionA Lucene collection is a directory of files. These files make up the indexed and returnable data of a Solr search repository.
DistributionThe copying of a collection from the master server to all slaves. The distribution process takes advantage of Lucene's index file structure.
Inserts and DeletesAs inserts and deletes occur in the collection, the directory remains unchanged. Documents are always inserted into newly created files. Documents that are deleted are not removed from the files. They are flagged in the file, deletable, and are not removed from the files until the collection is optimized.
Master and SlaveThe Solr distribution model uses the master/slave model. The master is the service which receives all updates initially and keeps everything organized. Solr uses a single update master server coupled with multiple query slave servers. All changes (such as inserts, updates, deletes, etc.) are made against the single master server. Changes made on the master are distributed to all the slave servers which service all query requests from the clients.
UpdateAn update is a single change request against a single Solr instance. It may be a request to delete a document, add a new document, change a document, delete all documents matching a query, etc. Updates are handled synchronously within an individual Solr instance.
OptimizationA process that compacts the index and merges segments in order to improve query performance. New secondary segment(s) are created to contain documents inserted into the collection after it has been optimized. A Lucene collection must be optimized periodically to maintain satisfactory query performance. Optimization is run on the master server only. An optimized index will give you a performance gain at query time of at least 10%. This gain may be more on an index that has become fragmented over a period of time with many updates and no optimizations. Optimizations require a much longer time than does the distribution of an optimized collection to all slaves.
SegmentsThe number of files in a collection.
mergeFactorA parameter that controls the number of files (segments) in a collection. For example, when mergeFactor is set to 3, Solr will fill one segment with documents until the limit maxBufferedDocs is met, then it will start a new segment. When the number of segments specified by mergeFactor is reached--in this example, 3--then Solr will merge all the segments into a single index file, then begin writing new documents to a new segment.
SnapshotA directory containing hard links to the data files. Snapshots are distributed from the master server when the slaves pull them, "smartcopying" the snapshot directory that contains the hard links to the most recent collection data files.

Configuring the Replication RequestHandler on a Master Server

Before running a replication, you should set the following parameters on initialization of the handler:

NameDescription
replicateAfterString specifying action after which replication should occur. Valid values are commit, optimize, or startup. There can be multiple values for this parameter. If you use "startup", you need to have a "commit" and/or "optimize" entry also if you want to trigger replication on future commits or optimizes.
backupAfterString specifying action after which a backup should occur. Valid values are commit, optimize, or startup. There can be multiple values for this parameter. It is not required for replication, it just makes a backup.
maxNumberOfBackupsInteger specifying how many backups to keep. This can be used to delete all but the most recent N backups.
confFilesThe configuration files to replicate, separated by a comma.
commitReserveDurationIf your commits are very frequent and your network is slow, you can tweak this parameter to increase the amount of time taken to download 5Mb from the master to a slave. The default is 10 seconds.

The example below shows how to configure the Replication RequestHandler on a master server.

<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="master">
<str name="replicateAfter">optimize</str>
<str name="backupAfter">optimize</str>
<str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>
<str name="commitReserveDuration">00:00:10</str>
</lst>
<int name="maxNumberOfBackups">2</int>
</requestHandler>

Replicating solrconfig.xml

In the configuration file on the master server, include a line like the following:

<str name="confFiles">solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml</str>

This ensures that the local configuration solrconfig_slave.xml will be saved as solrconfig.xml on the slave. All other files will be saved with their original names.

On the master server, the file name of the slave configuration file can be anything, as long as the name is correctly identified in the confFiles string; then it will be saved as whatever file name appears after the colon ':'.

Configuring the Replication RequestHandler on a Slave Server

The code below shows how to configure a ReplicationHandler on a slave.

<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="slave">
<!--fully qualified url for the replication handler of master. It is possible to pass on this as
a request param for the fetchindex command-->
<str name="masterUrl">http://remote_host:port/solr/corename/replication</str>
<!--Interval in which the slave should poll master .Format is HH:mm:ss . If this is absent slave does not
poll automatically.
But a fetchindex can be triggered from the admin or the http API -->
<str name="pollInterval">00:00:20</str>
<!-- THE FOLLOWING PARAMETERS ARE USUALLY NOT REQUIRED-->
<!--to use compression while transferring the index files. The possible values are internal|external
if the value is 'external' make sure that your master Solr has the settings to honor the
accept-encoding header.
See here for details: http://wiki.apache.org/solr/SolrHttpCompression
If it is 'internal' everything will be taken care of automatically.
USE THIS ONLY IF YOUR BANDWIDTH IS LOW . THIS CAN ACTUALLY SLOWDOWN REPLICATION IN A LAN-->
<str name="compression">internal</str>
<!--The following values are used when the slave connects to the master to download the index files.
Default values implicitly set as 5000ms and 10000ms respectively. The user DOES NOT need to specify
these unless the bandwidth is extremely low or if there is an extremely high latency-->
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
<!-- If HTTP Basic authentication is enabled on the master, then the slave can be
configured with the following -->
<str name="httpBasicAuthUser">username</str>
<str name="httpBasicAuthPassword">password</str>
</lst>
</requestHandler>
If you are not using cores, then you simply omit the corename parameter above in the masterUrl. To ensure that the URL is correct, just hit the URL with a browser. You must get a status OK response.

Setting Up a Repeater with the ReplicationHandler

A master may be able to serve only so many slaves without affecting performance. Some organizations have deployed slave servers across multiple data centers. If each slave downloads the index from a remote data center, the resulting download may consume too much network bandwidth. To avoid performance degradation in cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both a master and a slave.

  • To configure a server as a repeater, the definition of the Replication requestHandler in the solrconfig.xml file must include file lists of use for both masters and slaves.
  • Be sure to set the replicateAfter parameter to commit, even if replicateAfter is set to optimize on the main master. This is because on a repeater (or any slave), a commit is called only after the index is downloaded. The optimize command is never called on slaves.
  • Optionally, one can configure the repeater to fetch compressed files from the master through the compression parameter to reduce the index download time.

Here is an example of a ReplicationHandler configuration for a repeater:

<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="confFiles">schema.xml,stopwords.txt,synonyms.txt</str>
</lst>
<lst name="slave">
<str name="masterUrl">http://master.solr.company.com:8983/solr/replication</str>
<str name="pollInterval">00:00:60</str>
</lst>
</requestHandler>

Commit and Optimize Operations

When a commit or optimize operation is performed on the master, the RequestHandler reads the list of file names which are associated with each commit point. This relies on the replicateAfter parameter in the configuration to decide which types of events should trigger replication.

Setting on the MasterDescription
commitTriggers replication whenever a commit is performed on the master index.
optimizeTriggers replication whenever the master index is optimized.
startupTriggers replication whenever the master index starts up.

The replicateAfter parameter can accept multiple arguments. For example:

<str name="replicateAfter">startup</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">optimize</str>

Slave Replication

The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the pollInterval parameter) to check the current index version of the master. If the slave finds out that the master has a newer version of the index it initiates a replication process. The steps are as follows:

  • The slave issues a filelist command to get the list of the files. This command returns the names of the files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any).
  • The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding) to download the full content or a part of each file. If the connection breaks in between , the download resumes from the point it failed. At any point, the slave tries 5 times before giving up a replication altogether.
  • The files are downloaded into a temp directory, so that if either the slave or the master crashes during the download process, no files will be corrupted. Instead, the current replication will simply abort.
  • After the download completes, all the new files are moved to the live index directory and the file's timestamp is same as its counterpart on the master.
  • A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.

Replicating Configuration Files

To replicate configuration files, list them using using the confFiles parameter. Only files found in the conf directory of the master's Solr instance will be replicated.

Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is changed on the master, that file will be replicated only after there is a new commit/optimize on master's index.

Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files are compared against their checksum. The schema.xml files (on master and slave) are judged to be identical if their checksums are identical.

As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before moving them into their ultimate location in the conf directory. The old configuration files are then renamed and kept in the same conf/ directory. The ReplicationHandler does not automatically clean up these old files.

If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a core-reload command instead of a commit command.

Resolving Corruption Issues on Slave Servers

If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will not undertake any action to put itself in sync, until the master has new index data. When a commit operation takes place on the master, the index version of the master becomes different from that of the slave. The slave then fetches the list of files and finds that some of the files present on the master are also present in the local index but with different sizes and timestamps. This means that the master and slave have incompatible indexes. To correct this problem, the slave then copies all the index files from master to a new index directory and asks the core to load the fresh index from the new directory.

HTTP API Commands for the ReplicationHandler

You can use the HTTP commands below to control the ReplicationHandler's operations.

CommandDescription
http://_master_host_:_port_/solr/replication?command=enablereplicationEnables replication on the master for all its slaves.
http://_master_host_:_port_/solr/replication?command=disablereplicationDisables replication on the master for all its slaves.
http://_host_:_port_/solr/replication?command=indexversionReturns the version of the latest replicatable index on the specified master or slave.
http://_slave_host_:_port_/solr/replication?command=fetchindexForces the specified slave to fetch a copy of the index from its master.

If you like, you can pass an extra attribute such as masterUrl or compression (or any other parameter which is specified in the <lst name="slave"> tag) to do a one time replication from a master. This obviates the need for hard-coding the master in the slave.
http://_slave_host_:_port_/solr/replication?command=abortfetchAborts copying an index from a master to the specified slave.
http://_slave_host_:_port_/solr/replication?command=enablepollEnables the specified slave to poll for changes on the master.
http://_slave_host_:_port_/solr/replication?command=disablepollDisables the specified slave from polling for changes on the master.
http://_slave_host_:_port_/solr/replication?command=detailsRetrieves configuration details and current status.
http://host:port/solr/replication?command=filelist&indexversion=<index-version-number>Retrieves a list of Lucene files present in the specified host's index. You can discover the version number of the index by running the indexversion command.
http://_master_host_:_port_/solr/replication?command=backupCreates a backup on master if there are committed index data in the server; otherwise, does nothing. This command is useful for making periodic backups. The numberToKeep request parameter can be used with the backup command unless the maxNumberOfBackups initialization parameter has been specified on the handler – in which case maxNumberOfBackups is always used and attempts to use the numberToKeep request parameter will cause an error.

Index Replication using ssh and rsync

Solr supports ssh/rsync-based replication. This mechanism only works on systems that support removing open hard links.

Solr distribution is similar in concept to database replication. All collection changes come to one master Solr server. All production queries are done against query slaves. Query slaves receive all their collection changes indirectly — as new versions of a collection which they pull from the master. These collection downloads are polled for on a cron'd basis.

A collection is a directory of many files. Collections are distributed to the slaves as snapshots of these files. Each snapshot is made up of hard links to the files so copying of the actual files is not necessary when snapshots are created. Lucene only significantly rewrites files following an optimization command. Generally, once a file is written, it will change very little, if at all. This makes the underlying transport of rsync very useful. Files that have already been transferred and have not changed do not need to be re-transferred with the new edition of a collection.

The Snapshot and Distribution Process

Here are the steps that Solr follows when replicating an index:

  1. The snapshooter command takes snapshots of the collection on the master. It runs when invoked by Solr after it has done a commit or an optimize.
  2. The snappuller command runs on the query slaves to pull the newest snapshot from the master. This is done via rsync in daemon mode running on the master for better performance and lower CPU utilization over rsync using a remote shell program as the transport.
  3. The snapinstaller runs on the slave after a snapshot has been pulled from the master. This signals the local Solr server to open a new index reader, then auto-warming of the cache(s) begins (in the new reader), while other requests continue to be served by the original index reader. Once auto-warming is complete, Solr retires the old reader and directs all new queries to the newly cache-warmed reader.
  4. All distribution activity is logged and written back to the master to be viewable on the distribution page of its GUI.
  5. Old versions of the index are removed from the master and slave servers by a cron'd snapcleaner.

If you are building an index from scratch, distribution is the final step of the process.

Manual copying of index files is not recommended; however, running distribution commands manually (that is, not relying on crond to run them) is perfectly fine.

Snapshot Directories

Snapshots are stored in directories whose names follow this format: snapshot.yyyymmddHHMMSS

All the files in the index directory are hard links to the latest snapshot. This design offers these advantages:

  • The Solr implementation can keep multiple snapshots on each host without needing to keep multiple copies of index files that have not changed.
  • File copying from master to slave is very fast.
  • Taking a snapshot is very fast as well.

Solr Distribution Scripts

For the Solr distribution scripts, the name of the index directory is defined either by the environment variable data_dir in the configuration file solr/conf/scripts.conf or the command line argument -d. It should match the value used by the Solr server which is defined in solr/conf/solrconfig.xml.

All Solr collection distribution scripts are bundled in a Solr release and reside in the directory solr/src/scripts. It's recommended that you install the scripts in a solr/bin/ directory.

Collection distribution scripts create and prepare for distribution a snapshot of a search collection after each commit and optimize request if the postCommit and postOptimize event listener is configured in solrconfig.xml to execute snapshooter.

The snapshooter script creates a directory snapshot.<ts>, where <ts> is a timestamp in the format, yyyymmddHHMMSS. It contains hard links to the data files.

Snapshots are distributed from the master server when the slaves pull them, "smartcopying" the snapshot directory that contains the hard links to the most recent collection data files.

NameDescription
snapshooterCreates a snapshot of a collection. Snapshooter is normally configured to run on the master Solr server when a commit or optimize happens. Snapshooter can also be run manually, but one must make sure that the index is in a consistent state, which can only be done by pausing indexing and issuing a commit.
snappullerA shell script that runs as a cron job on a slave Solr server. The script looks for new snapshots on the master Solr server and pulls them.
snappuller-enableCreates the file solr/logs/snappuller-enabled, whose presence enables snappuller.
snapinstallerInstalls the latest snapshot (determined by the timestamp) into the place, using hard links (similar to the process of taking a snapshot). Then solr/logs/snapshot.current is written and scp'd (secure copied) back to the master Solr server. snapinstaller then triggers the Solr server to open a new Searcher.
snapcleanerRuns as a cron job to remove snapshots more than a configurable number of days old or all snapshots except for the most recent n number of snapshots. Also can be run manually.
rsyncd-startStarts the rsyncd daemon on the master Solr server which handles collection distribution requests from the slaves.
rsyncd daemonEfficiently synchronizes a collection--between master and slaves--by copying only the files that actually changed. In addition, rsync can optionally compress data before transmitting it.
rsyncd-stopStops the rsyncd daemon on the master Solr server. The stop script then makes sure that the daemon has in fact exited by trying to connect to it for up to 300 seconds. The stop script exits with error code 2 if it fails to stop the rsyncd daemon.
rsyncd-enableCreates the file solr/logs/rsyncd-enabled, whose presence allows the rsyncd daemon to run, allowing replication to occur.
rsyncd-disableRemoves the file solr/logs/rsyncd-enabled, whose absence prevents the rsyncd daemon from running, preventing replication.

For more information about usage arguments and syntax see the SolrCollectionDistributionScripts page on the Solr Wiki.

Solr Distribution-related Cron Jobs

The distribution process is automated through the use of cron jobs. The cron jobs should run under the user ID that the Solr server is running under.

Cron JobDescription
snapcleanerThe snapcleaner job should be run out of cron at the regular basis to clean up old snapshots. This should be done on both the master and slave Solr servers. For example, the following cron job runs everyday at midnight and cleans up snapshots 8 days and older:

0 0 * * * <solr.solr.home>/solr/bin/snapcleaner -D 7

Additional cleanup can always be performed on-demand by running snapcleaner manually.
snappuller snapinstallerOn the slave Solr servers, snappuller should be run out of cron regularly to get the latest index from the master Solr server. It is a good idea to also run snapinstaller with snappuller back-to-back in the same crontab entry to install the latest index once it has been copied over to the slave Solr server.

For example, the following cron job runs every 5 minutes to keep the slave Solr server in sync with the master Solr server:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * <solr.solr.home>/solr/bin/snappuller;<solr.solr.home>/solr/bin/snapinstaller

Modern cron allows this to be shortened to */5 * * * *....

Performance Tuning for Script-based Replication

Because fetching a master index uses the rsync utility, which transfers only the segments that have changed, replication is normally very fast. However, if the master server has been optimized, then rsync may take a long time, because many segments will have been changed in the process of optimization.

  • If replicating to multiple slaves consumes too much network bandwidth, consider the use of a repeater.
  • Make sure that slaves do not pull from the master so frequently that a previous replication is still running when a new one is started. In general, it's best to allow at least a minute for the replication process to complete. But in configurations with low network bandwidth or a very large index, even more time may be required.

Commit and Optimization

On a very large index, adding even a few documents and then running an optimize operation causes the complete index to be rewritten. This consumes a lot of disk I/O and impacts query performance. Optimizing a very large index may even involve copying the index twice and calling optimize at the beginning and at the end. If some documents have been deleted, the first optimize call will rewrite the index even before the second index is merged.

Optimization is an I/O intensive process, as the entire index is read and re-written in optimized form. Anecdotal data shows that optimizations on modest server hardware can take around 5 minutes per GB, although this obviously varies considerably with index fragmentation and hardware bottlenecks. We do not know what happens to query performance on a collection that has not been optimized for a long time. We do know that it will get worse as the collection becomes more fragmented, but how much worse is very dependent on the manner of updates and commits to the collection. The setting of the mergeFactor attribute affects performance as well. Dividing a large index with millions of documents into even as few as five segments may degrade search performance by as much as 15-20%.

While optimizing has many benefits, a rapidly changing index will not retain those benefits for long, and since optimization is an intensive process, it may be better to consider other options, such as lowering the merge factor (discussed in this Guide in the section on Configuring the Lucene Index Writers).

Distribution and Optimization

The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes. A very large index may take hours. The variables include the size of the index and the speed of the hardware.

Distributing a newly optimized collection may take only a few minutes or up to an hour or more, again depending on the size of the index and the performance capabilities of network connections and disks. During optimization the machine is under load and does not process queries very well. Given a schedule of updates being driven a few times an hour to the slaves, we cannot run an optimize with every committed snapshot.

Copying an optimized collection means that the entire collection will need to be transferred during the next snappull. This is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a three-slave one-master configuration, distributing a newly-optimized collection takes approximately 80 seconds total. Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this optimize were rolled across the query tier, and if each collection being optimized were disabled and not receiving queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half. Additionally, the files would need to be synchronized so that the following rsync, snappull would not think that the independently optimized files were different in any way. This would also leave the door open to independent corruption of collections instead of each being a perfect copy of the master.

Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken out of service. The optimized collection can be distributed in the background as queries are being normally serviced. The optimization can occur at any time convenient to the application providing collection updates.

Nodes, Cores, Clusters and Leaders

Nodes and Cores

In SolrCloud, a node is Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data.

A Solr core is basically an index of the text and fields found in documents. A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria. It might be that they are going to provide different search interfaces to users (customers in the US and customers in Canada, for example), or they have security concerns (some users cannot have access to some documents), or the documents are really different and just won't mix well in the same index (a shoe database and a dvd database).

When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it (such as the base Solr URL, core name, etc). Smart clients and nodes in the cluster can use this information to determine who they need to talk to in order to fulfill a request.

New Solr cores may also be created and associated with a collection via CoreAdmin. Additional cloud-related parameters are discussed in the Parameter Reference page. Terms used for the CREATE action are:

  • collection: the name of the collection to which this core belongs. Default is the name of the core.
  • shard: the shard id this core represents. (Optional: normally you want to be auto assigned a shard id.)
  • collection.<param>=<value>: causes a property of <param>=<value> to be set if a new collection is being created. For example, use collection.configName=<configname> to point to the config for a new collection.

For example:

curl  'http://localhost:8983/solr/admin/cores?
action=CREATE&name=mycore&collection=collection1&shard=shard2'

Clusters

A cluster is set of Solr nodes managed by ZooKeeper as a single unit. When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted.

Creating a Cluster

A cluster is created as soon as you have more than one Solr instance registered with ZooKeeper. The section Getting Started with SolrCloud reviews how to set up a simple cluster.

Resizing a Cluster

Clusters contain a settable number of shards. You set the number of shards for a new cluster by passing a system property, numShards, when you start up Solr. The numShards parameter must be passed on the first startup of any Solr node, and is used to auto-assign which shard each instance should be part of. Once you have started up more Solr nodes than numShards, the nodes will create replicas for each shard, distributing them evenly across the node, as long as they all belong to the same collection.

To add more cores to your collection, simply start the new core. You can do this at any time and the new core will sync its data with the current replicas in the shard before becoming active.

You can also avoid numShards and manually assign a core a shard ID if you choose.

The number of shards determines how the data in your index is broken up, so you cannot change the number of shards of the index after initially setting up the cluster.

However, you do have the option of breaking your index into multiple shards to start with, even if you are only using a single machine. You can then expand to multiple machines later. To do that, follow these steps:

  1. Set up your collection by hosting multiple cores on a single physical machine (or group of machines). Each of these shards will be a leader for that shard.
  2. When you're ready, you can migrate shards onto new machines by starting up a new replica for a given shard on each new machine.
  3. Remove the shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.

Leaders and Replicas

The concept of a leader is similar to that of master when thinking of traditional Solr replication. The leader is responsible for making sure the replicas are up to date with the same information stored in the leader.

However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines. If you have bootstrapped Solr with numShards=2, for example, your indexes are split across both shards. In this case, both shards are considered leaders. If you start more Solr nodes after the initial two, these will be automatically assigned as replicas for the leaders.

Replicas are assigned to shards in the order they are started the first time they join the cluster. This is done in a round-robin manner, unless the new node is manually assigned to a shard with the shardId parameter during startup. This parameter is used as a system property, as in -DshardId=1, the value of which is the ID number of the shard the new node should be attached to.

On subsequent restarts, each node joins the same shard that it was assigned to the first time the node was started (whether that assignment happened manually or automatically). A node that was previously a replica, however, may become the leader if the previously assigned leader is not available.

Consider this example:

  • Node A is started with the bootstrap parameters, pointing to a stand-alone ZooKeeper, with the numShards parameter set to 2.
  • Node B is started and pointed to the stand-alone ZooKeeper.

Nodes A and B are both shards, and have fulfilled the 2 shard slots we defined when we started Node A. If we look in the Solr Admin UI, we'll see that both nodes are considered leaders (indicated with a solid blank circle).

  • Node C is started and pointed to the stand-alone ZooKeeper.

Node C will automatically become a replica of Node A because we didn't specify any other shard for it to belong to, and it cannot become a new shard because we only defined two shards and those have both been taken.

  • Node D is started and pointed to the stand-alone ZooKeeper.

Node D will automatically become a replica of Node B, for the same reasons why Node C is a replica of Node A.

Upon restart, suppose that Node C starts before Node A. What happens? Node C will become the leader, while Node A becomes a replica of Node C.

 

来源:https://cwiki.apache.org/confluence/display/solr/Nodes%2C+Cores%2C+Clusters+and+Leaders

lucene query time join (关联搜索)

query  time join  已经在solr存在一段时间了但是lucene中有这个终归是好事,多一种选择嘛,实现这种关联的document还是很实用的,可以实现部分关联查询;

更新属性的时候也可以设计好结构,更新部分索引了。

这个解释有点费劲,还是看代码吧,

  1. final String idField = "id";
  2. +    final String toField = "productId";
  3. +
  4. +    Directory dir = newDirectory();
  5. +    RandomIndexWriter w = new RandomIndexWriter(
  6. +        random,
  7. +        dir,
  8. +        newIndexWriterConfig(TEST_VERSION_CURRENT,
  9. +            new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy()));
  10. +
  11. +    // 0
  12. +    Document doc = new Document();
  13. +    doc.add(new Field("description", "random text", TextField.TYPE_STORED));
  14. +    doc.add(new Field("name", "name1", TextField.TYPE_STORED));
  15. +    doc.add(new Field(idField, "1", TextField.TYPE_STORED));
  16. +    w.addDocument(doc);
  17. +
  18. +    // 1
  19. +    doc = new Document();
  20. +    doc.add(new Field("price", "10.0", TextField.TYPE_STORED));
  21. +    doc.add(new Field(idField, "2", TextField.TYPE_STORED));
  22. +    doc.add(new Field(toField, "1", TextField.TYPE_STORED));
  23. +    w.addDocument(doc);
  24. +
  25. +    // 2
  26. +    doc = new Document();
  27. +    doc.add(new Field("price", "20.0", TextField.TYPE_STORED));
  28. +    doc.add(new Field(idField, "3", TextField.TYPE_STORED));
  29. +    doc.add(new Field(toField, "1", TextField.TYPE_STORED));
  30. +    w.addDocument(doc);
  31. +
  32. +    // 3
  33. +    doc = new Document();
  34. +    doc.add(new Field("description", "more random text", TextField.TYPE_STORED));
  35. +    doc.add(new Field("name", "name2", TextField.TYPE_STORED));
  36. +    doc.add(new Field(idField, "4", TextField.TYPE_STORED));
  37. +    w.addDocument(doc);
  38. +    w.commit();
  39. +
  40. +    // 4
  41. +    doc = new Document();
  42. +    doc.add(new Field("price", "10.0", TextField.TYPE_STORED));
  43. +    doc.add(new Field(idField, "5", TextField.TYPE_STORED));
  44. +    doc.add(new Field(toField, "4", TextField.TYPE_STORED));
  45. +    w.addDocument(doc);
  46. +
  47. +    // 5
  48. +    doc = new Document();
  49. +    doc.add(new Field("price", "20.0", TextField.TYPE_STORED));
  50. +    doc.add(new Field(idField, "6", TextField.TYPE_STORED));
  51. +    doc.add(new Field(toField, "4", TextField.TYPE_STORED));
  52. +    w.addDocument(doc);
  53. +
  54. +    IndexSearcher indexSearcher = new IndexSearcher(w.getReader());
  55. +    w.close();
  56. +
  57. +    // Search for product
  58. +    Query joinQuery =
  59. +        JoinUtil.createJoinQuery(idField, false, toField, new TermQuery(new Term("name", "name2")), indexSearcher);
  60. +
  61. +    TopDocs result = indexSearcher.search(joinQuery, 10);
  62. +    assertEquals(2, result.totalHits);
  63. +    assertEquals(4, result.scoreDocs[0].doc);
  64. +    assertEquals(5, result.scoreDocs[1].doc);

上面的

JoinUtil.createJoinQuery(idField, false, toField, new TermQuery(new Term("name", "name2")), indexSearcher);

是lucene4.0中的,现在还是alphe版本,

在3.6中没有第二个boolean参数,据说4.0的要比3.6的 query  time join快3倍,没验证过。