Dev:APIpush

Aus YaCyWiki
Wechseln zu: Navigation, Suche

Document Push API

Documents can be pushed directly to YaCy using the servlet at /api/push_p.json. All pushed documents are parsed and analysed using the YaCy built-in document parser and written to the Solr index using this servlet.

Call Attributes

This servlet must be called at /api/push_p.json using HTTP POST request using multipart/form-data parameter encoding. It may also be possible to call this servlet using GET, but this should only be done for test cases because the purpose of that servlet is a transmission of large, binary data.

count An integer number denoting the number of documents pushed within this POST request
synchronous Possible values are 'true' and 'false'. If synchronous=true, then all documents within this POST request are written to the search index before the servlet returns a http success code to the caller. If the attribute is not present or synchronous=false, then the submitted documents are written to a concurrent indexing job and processed efficiently, scaling with the number of cores in the CPU of YaCys search server; in this case the servlet returns a http success code immediately.
commit Possible values are 'true' and 'false'. If commit=true, then a Solr commit command is appended to the processing of synchronous document writes. This will cause that documents, pushed with this interface are immediately searchable right after the servlet returns a http success code. The commit option is only useful, if also synchronous=true; in case that synchronous=false and commit=true the synchronous flag is switched to true as well.

Also attached must be a set of <count> documents with their metadata. The documents are numbered from 0 to <count>-1. This document number X is used to name attribute field names for each of the documents. In the following list, replace the character X with those numbers of the file:

url-X The URL which is used in a search result as link to the submitted document.
data-X This is the binary data of the document.
collection-X A name for a collection which is assigned to the document. This can be an arbitrary word or a comma-separated list of terms. These words will be listed in the collection navigation for a search facet (if that facet is switched on).
responseHeader-X A HTTP response header line. This can be used to submit all kinds of metadata which YaCy is able to process.

You should submit the Content-Type and Last-Modified header fields using the responseHeader-X post attribute. This would look like this:

responseHeader-X=Last-Modified:<Last-Modified-String> The <Last-Modified-String> is date which is assigned to the document. The date format must be according to RFC 1123 like "EEE, dd MMM yyyy HH:mm:ss Z“ with a time zone indicator according to RFC 5322
responseHeader-X=Content-Type:<Content-Type-String> The <Content-Type-String> is mime type of the document.

Because media-type documents do not have a textual component which can be used for searching, it is possible to attach the title and keywords to the media document as well. To do this, the extra-http header fields X-YaCy-Media-Title and X-YaCy-Media-Keywords can be used.

responseHeader-X=X-YaCy-Media-Title:<Title> The <Title> will be used as document title
responseHeader-X=X-YaCy-Media-Keywords:<Keywords> <Keywords> is a list of keywords, separated by space characters.

Performance

When called with synchronous=false and commit=false, this API may be the fastest way to inser raw document data into the YaCy search index. It is strongly recommended to avoid synchronous=true or commit=true to get the best performance since the pushed documents are parsed and indexed using a concurrent, self-scaling and full multi-core capable application infrastructure. It is also possible to concurrently call the /api/push_p.json servlet with more than one client tasks. The concurrent indexing option with synchronous=false and commit=false is recommendet for typical initial-load tasks where only once a large set of documents are indexed.

If the API is called with ynchronous=true and/or commit=true, then the processing time will be linear with the number of documents in one call and will get an additional delay when commit=true. Using forced a commit too frequently will cause a strong defragmentation which will slow down search result processing as well.

Test Environment

To make it easier to test this servlet, it is also possible to call the API with GET. A typical call looks like:

http://localhost:8090/api/push_p.json?count=1&url-0=http://nowhere.cc/example.txt&data-0=hello world&responseHeader-0=Last-Modified:Tue, 15 Nov 1994 12:45:26 GMT&responseHeader-0=Content-Type:text/plain&collection-0=testpush

Using a command-line script, you would call i.e. (using wget)

wget -qO- "http://localhost:8090/api/push_p.json?count=1&url-0=http://nowhere.cc/example.txt&data-0=hello world&responseHeader-0=Last-Modified:Tue, 15 Nov 1994 12:45:26 GMT&responseHeader-0=Content-Type:text/plain&collection-0=testpush“

The servlet then returns a json result which explains how successful the transmission was. A typical result looks like

{
 "count":"1",
 "successall": "true",
 "item-0":{
   "item":"0",
   "url":"http://nowhere.cc/example.txt",
   "success": "true",
   "message": "http://localhost:8090/solr/select?q=sku:%22http://nowhere.cc/example.txt%22"
 },
 "countsuccess":1,
 "countfail":0
}

The "message" attribute contains a link to a solr search result which shows the pushed document in indexed metadata format. In case that a push is not successful, the "success" attribute turns to "false" and the "message" field contains the reason for the failure.