Dev:Index

Aus YaCyWiki
Wechseln zu: Navigation, Suche

A main element in web-search is the index. The index in general maps search requests to websites, that deal with the topic of the request. Before one discusses different indexing structures, wanted properties of said indexes must be defined.

Thus this article does not primarily deal with the index currently used by yacy. Instead it will list properties that are essential and useful for a distributed web-search index. Those properties will be useful to improve to understand the problems needed to be solved. And thus improve the current index.

Essential Properties

  • Performance: Search requests must be processed in an acceptable amount of time. This has the following implications:
    • Ping times are long. This means it is not performant for a search request to sequentially traverse many peers. But it is performant to talk to many peers in parallel.
    • Amount of data fetched must not be too big. It is not performant, if a search request results in 1000000 hits and all get send to the searcher, because it causes great amounts of traffic and the searcher might even have to sort all the results.
    • Local index processing time: In most cases the internet is much slower than the local index. Thus we have time there fetching the data. But if we have to deal with many search requests in parallel this might become a serious issue. There might also be other processes requesting the index, like crawling.
  • Resource usage on peers: There is no infinite ram and there is no infinite hard disk space.
  • Redundant storage: Most peers are not online most time. Thus data must be stored redundantly. If some data gets more requests than others, there should be several peers online that can deal with these requests.

Useful Properties

  • Quality of search results. Good matching hits should come first. Thus results must be ranked.
    • The ranking of pages depends on the search request!
    • Criteria for the relevance of a search result:
      • occurrence of words in search request
      • relevant pages referencing to the page
      • relevant pages getting referenced by the page
      • amount of users visiting the page
      • amount of users visiting the page after searching for similar requests
  • Transferability of index data. It should be possible to easily transport parts of one's own index to other peers to prevent data loss.
  • Maintainability of the index. The index should be repairable in case it gets corrupt.
  • Prevention against misuse. People will try to influence the visibility of specific webpages.
  • Illegal data: If some sites contain illegal data for some peers, those peers must not store such data on their hard disk unencrypted.

Ranking ideas

Resources as Points in high dimensional Vector Space

In general each resource could be classified as a list of words and their relevance. Thus each resource can also be seen as a point in a vector space, where each dimension is specified by the relevance to one word. Correspondingly each search request is a subspace of said vector-space and the results get ranked by their distance to the subspace of the search request.

Unfortunately it is probably not possible to store the data efficiently in such a way, because there are only very few points in that vector space relative to the number of dimensions. And I believe it is not possible to circumvent exponential growth in the dimensions at some stage.

A way to circumvent the problem would be to reduce the number of dimensions, by clustering the words first and then doing refined searches.

Ranking only for one word

Resources get only ranked according to single words. This is rather easy, because the index can just map each word to a sorted list of results. It has the main disadvantage, that it does not give a ranked list of URLs when searched by more than one word. Those results must then be postprocessed, which may very time consuming, if many results were found (Google sometimes finds more than 231 results!).

Current implementation

Icon work.png TODO: This information is not complete and might even be wrong at some places, please correct!

The current implementation consists of RWIs and URLs. RWIs are "Reverse Word Index" indexes that map Word-Hashes onto URLs. The index is a two dimensional table (indexed by RWIs and URLs). This table is mapped in ring order onto the peers. This is useful, because if peers join or leave, the words required to be stored for each peer doesn't change much.

Additionally every peer knows immediately when starting a search request, which peers she must ask to get the results. Thus no unnecessary routing is performed. The only routing is by opening connections to further distant peers. All requests can be done in parallel, thus network-lags are reduced to a minimum.

Also every peer only stores the URLs of the results, so this results in little space consumption and no illegal data can be stored by the peers.

The local RWI-index is stored as a table. This table is accessed via a binary tree in RAM. This results in fast access times (but rather huge RAM consumption).

Parts of the local index can be send to other peers.

Disadvantages

  • URLs can only be ranked for each word seperately. Thus many search results must be acquired and then post-processed.
  • Words containing searched words do not get found.

References

Alternate Implementations

For alternate implementation ideas, please see the discussion article to this article.