ScallyWhack/SpamTypes

Analysis of different spam types

I've started to monitor the spam that hit madwifi.org when the amount of spam comments showing up in the ticket tracker rised. That was shortly after I fixed an issue which caused Google to dislike madwifi.org. Once that issue was resolved and the site was listed on Google and friends, the amount of spam increased... no real surprise here, I guess.

So far I saw the following types of spam - on madwifi.org and other Trac installations I've checked:

Type 1: #preview spam

It seems that there is at least one spam bot out there that tries to post to, for example, /ticket/123#preview rather than just /ticket/123 as browsers usually do. Many of these posts don't show up in Trac, and looking at Trac's log it seems this happens because of an issue related to the timestamp the bot sends along with its submission:

2006-07-10 12:36:46,452 Trac[main] ERROR: Sorry, can not save your changes. This ticket has been modified by someone else since you started

And those submissions that have a proper timestamp can be sorted out by blocking POST requests to something like /...#preview. This seems to be safe, as I saw no legitimate user agent trying to submit content like this.

Type 2: no cookie spam

Trac requires user agents to support cookies, in order to store session and/or authentication information in them. Legitimate users who have cookies turned off in their browser will see a warning message from Trac once they enter the site.

In order to submit a comment or a new ticket, a legitimate user will have to hit the site with at least one GET request. That allows Trac to either set a trac_session (anonymous user) or a trac_auth (registered user who has logged in before) cookie or, if cookie support is disabled, warn the user as explained above.

On the other hand, many spam bots seem to use Google and other search engines to discover Trac installations and existing tickets. They then POST their spamvertisements directly to the tickets. Even if the underlying user agent (library) would support cookies, Trac is given no chance to set its cookies. So blocking POSTs which have no cookie set seems to be safe.

Type 3: html processor spam

Some of the spammers are quite "clever". They make use of the HTML processor that comes with Trac (see here and here) to hide the spam from users - I think they hope that this increases the chance that their spam remains unnoticed by the administrators.

Here's an example of such a comment:

{{{
#!html
<div style="overflow:auto; height: 1px;">
<a href="http://spammers-website.tld/faked-handbags-suck.html">faked handbags</a>
<a href="http://spammers-website.tld/so-do-faked-sunglasses.html">faked sunglasses</a>
...
</div>
}}}

The HTML processor "hides" the content of the comment due to the style settings, so it's easy to think it's just an empty comment.

Type 4: markup spam

I can only guess, but it seems that some of the spam bots out there are either all-rounders or dumb (most probably both). There are spam comments which make use of various forms of "markup language", such as BBcode or HTML (without using the HTML processor). Some of them even contain the plain URLs without any markup, hoping that they will be converted to clickable links.

BBcode example:

... [url=http://spamvertised.tld/some-stupid-page.html]visit this site![/url] ...

HTML example:

... <a href="http://spamvertised.tld/some-stupid-page.html">visit this site!</a> ...

The results are not showing up as expected, nevertheless these comments do their unwanted job: Trac automatically makes fully-working, "clickable" links out of plain-text URLs that will be followed by search engine spiders. And that's all that counts for a spammer, isn't it?

Type 5: LED spam

If I remember correctly, this guy was the first spammer that hit our site, and he seems to be well-known in the Trac community (check this blog post, for example, and see what google finds). From what I can tell this guy is a moron sitting somewhere in China, submitting the same post over and over again manually via his browser. My knowledge of chinese language is quite bad, but it seems that he spamvertises a site that is in the LED business.

Type 6: attachment description spam

When attaching files to a wiki page users are allowed to describe their attachment. This description might contain WikiFormatting which is parsed when the description is displayed. Thus the description can be misused to embed links to spamvertised pages - which makes it roughly related to type 4.

In Trac 0.9.x neither "recent changes" nor the timeline notifies about attachments, which makes it quite hard to spot this type of attack. Fortunately this has changed in Trac 0.10.

Type 7: HTML attachment spam

This type is a pretty tricky variant, and it's worth explaining some backgrounds for those who are not that familiar with Trac. Consider the following two things:

  1. Trac allows to attach files to wiki pages and tickets, which is generally a pretty helpful feature. Being the clever tool that Trac is, it not only provides access to a highlighted interpretation of the attachment but also allows visitors to download the attachment in it's original/raw format. It takes care to present the appropriate MIME type for the downloaded attachment to make sure that the user agents handle the download correctly.
  1. Remember: all the spammers have in mind is to abuse the (usually high) page rank of community sites to push their own page rank, making them appear high up in result lists for well-defined search terms.

With that theory in the back of your mind, let's head over to an example. Imagine three websites:

  • SiteA is the site of which the spammer wants to push the pagerank; he sells faked watches there.
  • SiteB is a Trac-driven site.
  • SiteC is a random collection of sites that allows users to post fully working links to other pages/sites; they could be guestbooks, forums, blogs, and so on.

The spammer creates a simple HTML file named spamvertisement.html. It contains tons of links to pages on SiteA along with popular search terms (terms for which SiteA should be listed as high up in the list of search results as possible) and could look like this:

<html>
<body>
  <a href="http://SiteA/fake-handbags.html">cheap handbacks</a><br>
  <a href="http://SiteA/cheap-handbags.html">popular handbacks</a><br>
  ...
</body>
</html>

The HTML file is then attached to, say, the WikiStart page of SiteB's Trac.

Last but not least the spammer starts to post comments with links to the "raw download link" http://SiteB/attachment/wiki/WikiStart/spamvertisement.html?format=raw to SiteC. This has two effects:

  1. It helps to push the pagerank of SiteB (which indirectly also increases the pagerank of SiteA)
  2. It increases the chances that spamvertisement.html is spotted search engine spiders.

Spiders which follow the link to spamvertisement.html are made believe that this file is part of SiteB's content, et voila, SiteB's pagerank boosts the pagerank of SiteA.

Type 8: visit my homepage vandalism

Some bots post comments to closed tickets, the reporter name being set to something like "black jack", "casino" or "free poker" and comments like:

hello great work great site visit my homepage thank you

I guess the bots are expected to be pretty univeral, as similar comments can be spotted not only on Trac-driven sites but also in guestbooks or blogs. They seem to fail posting their spamvertisement in the correct format, as their comments actually don't link anywhere. This is why I've decided to call that type of spam "vandalism" - although it's meant to be spam it has no effect other than annoying the users and admins of affected sites.

Type 9: double-quote+slash vandalism

Another pretty strange phenomenon: new tickets, having the summary set to ""/ while the actual description is empty. So far I couldn't make any sense from that, and I assume that this is yet another spambot that fails to do its job the way it's intended by its creator. As far as I remember these tickets had no links inside, so it's basically annoying and thus called vandalism.