En:PhpBB
Aus YaCyWiki
PhpBB 2.x uses something called "sid=" to keep track of pages.
YaCy should learn to strip phpBB's sid= when doing crawls, so
http://www.rechenkraft.net/phpBB/index.php?sid=988f7cf9b9491ca5c258ca359fc67e85
simply becomes
http://www.rechenkraft.net/phpBB/index.php
I've never seen anyone use ?=sid or &=sid to actually specify of switch between content.
phpBB owners solution
Use the "enhance-google-indexing" MOD, http://www.phpbb.com/phpBB/viewtopic.php?t=32328
The only "mod" is this:
#-----[ OPEN ]------------------------------------------ includes/sessions.php #-----[ FIND ]------------------------------------------ global $SID; if ( !empty($SID) && !preg_match('#sid=#', $url) ) #-----[ REPLACE WITH ]------------------------------------------ global $SID, $HTTP_SERVER_VARS; if ( !empty($SID) && !preg_match('#sid=#', $url) && !strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'Googlebot') && !strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'slurp@inktomi.com;')) # #-----[ SAVE/CLOSE ALL FILES ]------------------------------------------ # # EoM
just add
&& !strstr($HTTP_SERVER_VARS['HTTP_USER_AGENT'] ,'yacybot')
and the phpBB is all set.
forum discussion
There's a discussion at the german YaCy-Forum regarding detection and removing of session IDs: