VM Ware Image
- linux vmware-player image (security++, simplicity++)
Link Farm Crawling
To fight Linkfarms full of Spam it might be senseful to crawl such pages with a link-deep of about 1-2 and collect all of the liked domains and grewlist them. The user should get a Webpage with links to that pages afterwards, so the user can control the result if the user is not sure if a specific link is really spam. Afterwards, the user can decide to remove some domains from this list. The rest will get added to the blacklist. If a domain occurs about 1 + floor(sqrt(NumberOfPeers)) times in a blacklist, the site might get blocked within the whole YaCy-Network -- MovGP0 16:29, 4. Mai 2006 (CEST)
- Blocked in the whole net is not possible. We have no control, what a peerowner does. But we can send a News, which could be a hint for other peerowners from our peer. But if its more than one Pagemoderation per day, its to much to do for other peerowners ...
Feedback to rate search-result quality
- At the end of a page with search results, I would be happy to give "you" a feedback. So that I can say, if YaCy was finding my page or my information and perhaps where I finally found my information or which page is not yet part of our index. I think this could be a good way to improve the quality of YaCy... --GoogleFan 14:51, 2. Jun 2006 (CEST)
More from this page
- Just show a few results per domain and a link/button "more from this site" so if I try to find information about a company/site (e.g. microsoft) the results aren't flooded with results from their site. Helpfull if I do some research and don't want to get all the marketing crap.--Neo@NHNG 14:58, 15. Feb 2008 (CET)
- Would be very useful to inlude external URL blacklist lookup feature to the crawler. Uribl and Surbl are probably the most well-known blacklists.
--Ott 13:27, 26. Jan 2009 (CET)
- RDF-Storage based on the Jena Framework.
If the crawler finds an RDF-File (whitch means .rdf, .owl, and .foaf Files) or RDF-Markup within a xHTML-File, the Content of this RDF should get copied into a distributed Jena-based Semantic Storage (afaik Jena is not mind to support distributed computing/querying, so you might need to develop you're own storage). Also it should be possible to make global SPARQL-Queries on this Storage. There is also the need for a timeout, so that Semantic Queries won't take to many resources.
This is also interesting when wanting to offer RSS 1.0 and RSS 1.1 support.
Notice, that I think that this whish is a realistic goal for version 3.x. RSS 0.9, RSS 0.91, and RSS 0.92 should not get supported, because there are not compatible with RDF.
-- MovGP0 15:39, 4. Mai 2006 (CEST)