Feature wishlist
This is the place where I gather ideas for future features. Feel free to add your suggestions.
robots.txt enforcement
I noticed that, at least on madwifi.org, not all search engine spiders obey robots.txt. Especially when running Trac on a slow server, this can cause additional CPU load for (from the point of view of a search engine) useless transactions. So the idea is to block requests from identified spiders to website resources that are disallowed as of the robots.txt.
Required ingredients:
- A database of well-known search engine spiders, their IP address(es| ranges), their user-agent strings and regular expressions that match the user-agents
- A script that reads a robots.txt file and generates a set of appropriate rules that help to enforce the limitations specified there
The database is fed from information already provided by various sources (see below) plus information retrieved from local access logs. It then is used as source for a search engine crawler DNS blacklist.
The robots.txt-parsing script needs to distinct two different cases:
- Sections that apply to all spiders (User-agent: *):
Rules generated for this case should query the blacklist rather than trying to match the user-agent against the known spider UAs. - Sections that apply to a distinct spider (User-agent: googlebot):
In this case it's easier/cheaper to match the user-agent strings against the RE that is listed in the database for the spider in question.
Existing databases that could be helpful:
- http://www.user-agents.org/
- http://www.useragentstring.com/
- http://user-agent-string.info/
- http://www.nttdocomo.co.jp/english/service/imode/make/content/spec/useragent/index.html
- http://webcab.de/wapua.htm
- http://www.mobileopera.com/reference/ua
- http://test.waptoo.com/v2/skins/waptoo/user.asp
- http://www.handy-ortung.com/
- http://www.botsvsbrowsers.com/
- http://www.seehowitruns.org/pages/useragentlisting.php
- http://www.robotstxt.org/wc/active.html
Other links:
