The Roadmap is an ordered list of features and their dependencies.
- 1 Challenges
- 2 Generalization of Date
- 3 RWI RAM-Cache organized by Peer-Hashes
- 4 Control of Search results according to DHT transmission
- 5 Security checks for remote crawls
- 6 Autoupdate
- 7 License-Techniques / Secure Peer Hashes
- 8 kelondroXML
- 9 YaCyBlog
- 10 Crawler
- 11 Internet Cafe functions
- 12 Todo Wishlist
- 12.1 Language support
- 12.2 More configuration options
- 12.3 Filetype Support
- 12.4 SOCKS5 support
- 12.5 No Search Results
- 12.6 Next.. (and previous) buttons
- 12.7 Help pages for "more options"
- 12.8 Search filtering
- 12.9 Blacklist-subscriptions
- 13 implemented and ready for next Release
All challenges listed here have been achieved!
- 1000 PPM on a single peer: can be done using a local intranet indexing (there will no forced pauses to prevent DoS)
- > 10000 PPM for all peers permanently (as seen in Sciencenet)
- Local search < 1 second for any search (single word or combinations), no blocking: we can do > 10 searches per seconds now!
- Global search < 3 seconds: this is the default.
- Global search work proper for combinations of words: mostly
- Parsing/indexing of very large files without
OutOfMemory: well, where is an example?
Generalization of Date
Currently the date that is stored and propagated through principals in seeds is system-dependent. This prevents peers in other time zones from participating. GMT shall be used everywhere. Fix needed.
RWI RAM-Cache organized by Peer-Hashes
This will make it possible to use RAM-Cache to select RWIs for DHT transmission. This will reduce IO-load, since it prevents cache from being flushed to disk, loaded, deleted again during DHT transmission.
Control of Search results according to DHT transmission
We must monitor if the selected peers for a remote search are in fact the most effective peers. To do that, a new menu item (besides the Performance-Menu) is needed that shows how many RWIs each peer has sent. This information is already available in the log and simply needs to be presented.
Security checks for remote crawls
Remote peers must check assigned crawls to see if they are in fact a best choice for this crawl. This is necessary since the remote crawl shall load
robots.txt. However, forced loading of this file could be misused to make remote peers initiate DDoS attacks.
Automatic update of YaCy.
Should not be too hard:
- 1. Download compressed new YaCy package
- 2. Uncompress to temporary location
- 3. Stop YaCy
- 4. Delete everything except DATA
- 5. Move new YaCy files from temp location to destination folder
- 6. Start YaCy
- 1. How to quit YaCy?
- 2. Where to get new version from?
- 3. How to ensure "clean" new version?
- 1. The update program has to be a second program
- 2. a) YaCy shares (more complicated, but better solution)
b) Website (yacy.net, devbin.yacy-forum.de) (easier)
- 3. If 2a: Get MD5 sum from reliable source and compare with update-package. Source for MD5: trustworthy peer (secure peer hashes needed!!!) or website again, but then you could take 2b directly...
If 2b: These versions are clean.
Positive effects are obvious:
-no outdated peer
-less work for peer owner
-no newbie questions
-always highest security/performance for peer-owner
All points open to discussion (in forum, please).
License-Techniques / Secure Peer Hashes
- implement generation of key-pair
- computation of hash from keys
- protocol to check if remote hash and key are authentic/well-formed
- migraton from old hash to new hash for old peers without loss of DHT-collected RWIs
- XML Object interface "kelondroXMLInterface" that must be implemented to store objects in an kelondroXML
- "kelondroXML" Class that can generate an XML file using a kelondroRA and create Objects from XML files through kelondroRA
The News system lacks a broadcast funktion. We don't want to implement broadcasts with Messages. News would be most approriate for Broadcasts, but there is a need for a Broadcast message tracking system. A YaCyBlog would be most appropriate for that, so new Blog entries can be announces with YaCyNews.
- "infinite crawling" function
- Add a delay between each page YaCy fetches from each site. YaCy crawling is like a DDOS attack. Crawlers who pull 4 pages a minute are not a problem, but three YaCy pulling 4 pages a second is.
- Make sure YaCy loads robots.txt, logs show YaCy's pulling pages but no robots.txt loaded. This is supposedly implemented, perhaps it's old YaCy's or just very easy to change the source.
Internet Cafe functions
- YaCY Bookmarksmanager
- Todo: del.icio.us api
It's common for web pages to have something like:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
View the source of this wiki, it says
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de" dir="ltr">
So, YaCy should be able to
- Tag pages it indexes with a language code, no code defaults to English
- Do localized searches who only includes pages in given language
- Perhaps also a search option for everything (since a search limited to German probably would miss some German pages who are not marked as such - unless or until YaCy can magically "detect" a sites language)
See also: Discussion Forums: YACY Open Discussion, select language of search results
More configuration options
Default search options
The default number of search results, search local/global, time to wait, etc. It's hardcoded in index.html and yacysearch.html. Should be configurable. For example, you may want to limit the default search-time to make the search faster for "new" users, show 50 results by default (I always featch 100 searchresults when I use scroogle, anything less is just too limited and disallows me to decide what links are relevant myself - which is specially important when using YaCy), etc.
I'd really like "torrents" next to "Text Images Audio Video Applications", but I'd really like YaCy to have the standard filetype:.avi (which would also work as filetype:avi) etc. to look for pdf files, txt files, and so on.
Easy way to turn off all proxy functionality
Perhaps I missed something, but I found no way to turn off the proxyClient function. I only want YaCy for searching so I don't want it to be able to proxy anything, and there got to be a better way to turn it off than to set proxyClient=127.128.129.130.
Public Search Engine Mode
The bookmark/recommend/delete things on the search results require you to login.
Clicking "Info" takes you to the admin interface.
These things are OK you run your own YaCy and you are the only one using it, but it don't work for 5 or 10 or 100 users of one YaCy. It needs a public search engine-mode. No login-fuzz or admin-fuzz or wotnut. My old mother should be able to come visit and find YaCy on the browser she barely manages to open on her own and not get utterly confused by being asked for passwords and presented a admin interface with information she don't need and don't understand.
A public search engine-mode would be nice. No login links/no admin links - only search and search options.
YaCy only supports http proxies. It needs SOCKS5 support, including support for remote DNS lookups.
No Search Results
"No Results. If you think this is unsatisfactory then you may consider to support the global index by running your own proxy/peer. If everybody contributes, the results will get better."
At minimum, show the keyword(s) or string which was searched for. The text should at minimum be:
"No Results. what was entered into the form. If you think..(..)"
Also, search for something which is loosely the same and has results and show "Did you mean foo"?
If YaCy knows about 200 URL's and you choose only to show 10 search results then yacy should have typical
[<< Prev] 1 2 3 4 5 6 7 8 9 10 [Next >>]
bar at the bottom of the page, and it should also have the search forum displayed there at the bottom of the page also.
Also, including rel links like:
<link rel="prev" title="results 11 to 20" href="..."/>
makes for better metadata and enables better navigation using this firefox extension: https://addons.mozilla.org/en-US/firefox/addon/2933 .
Help pages for "more options"
I had to go on IRC to ask to figure find out that URL mask is regex.
Entering "foo.com" to search only domain "foo.com" doesn't work. Entering "foo.com/*" don't work. Entering "http://foo.com/.*" works. It would be nice to just enter foo.com, and YaCy could figure out there was no regex there and turn it into "http://foo.com/.*" if there is no regex.
More importantly, all of the options on "more option" should be links to help pages (could be done using frames too, so the text appears in a "help box" at the bottom of the pages).
Here's some links where such help text could be added to (and add them to YaCy later).
That "foo.com" don't work and I had to ask on IRC to figure it out sucks bigtime. Users who don't instantly recognize regex and understand you have to use regex (and what regex is) should be able to use YaCy. I don't have any "best" solution for how to solve this problem, but links to help text and regex being added if there is none would help.
YaCy should know a minimum of the most commonly used CMS systems used on the Internet today and behave accordingly.
- MediaWiki (this site) has all these links "&action=edit" and YaCy should know it should ignore them.
- Also, it should cut out En:PhpBB's sid= session-tracking.
The current blacklist-function allows you to import blacklists from other peers.
However, I'd prefer a feature where you can subscribe to a blacklist, ie. you add the peer and peers blacklist for subscription and the blacklist and new URLs added to it are automatically imported (So the blacklists name from the original peer is kept, and that blacklist is automatically updated when entries are added/removed as the peer it originates from changes).
Note that these blacklists should be opt-in, I'd prefer that there is absolutely no way for one peer, or a group of peers, to censor site(s) from other peers without them having signed up for their list(s).
implemented and ready for next Release
The old list of done items can be found here
- Bookmarksmanager (parts of the del.icio.us api, too)