[tahoe-dev] Modifying the robots.txt file on allmydata.org

David-Sarah Hopwood david-sarah at jacaranda.org
Thu Feb 25 03:40:32 UTC 2010

Zooko Wilcox-O'Hearn wrote:
> On Wednesday, 2010-02-24, at 1:50 , David-Sarah Hopwood wrote:
>> Allowing crawlers to index some of the dynamically generated pages  
>> under /trac could cause horrible breakage, given darcs+trac's  
>> performance problems. You'd have to look at what subsets of that  
>> are sufficiently static.
> The main thing to avoid is URLs that have "rev=XYZ" in them, like these:
> http://allmydata.org/trac/tahoe-lafs/browser/setup.cfg?rev=3996
> http://allmydata.org/trac/tahoe-lafs/browser/setup.cfg? 
> annotate=blame&rev=3996
> Those are asking darcs to reconstruct what a particular file or  
> directory looked like at some point in the past, which is relatively  
> expensive.

I believe all of these are query URLs.

http://www.robotstxt.org/orig.html doesn't say anything about whether
robots are allowed to visit query URLs, and it doesn't support any
way to exclude URLs that contain '?'.

The major search engines do in practice support the following syntax
(according to http://www.webmasterworld.com/robots_txt/3845376.htm
and, in the case of Google,

Disallow: /*?

so it may be useful to put that in, but be prepared to revert to a
more restrictive robots.txt if you see problems.

> On the other hand the trac-darcs plugin caches the results of those  
> in its sqlite db, so perhaps letting a spider laboriously crawl the  
> whole thing is a way to fix the performance problems. :-)

If you don't mind the site being unusable for a few months while it
does that :-)

David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 292 bytes
Desc: OpenPGP digital signature
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20100225/ab16e2d1/attachment.asc>

More information about the tahoe-dev mailing list