cipherdyne.org

Michael Rash, Security Researcher



Trac    [Summary View]

Switched from subversion to git

switched to git After using subversion for several years, I've switched to git for all cipherdyne.org projects. Subversion has certainly served its purpose, but it is hard to look at git and not feel a compelling draw. Further, with easy to set up web interfaces to git repositories such as gitweb and free hosting services such as github, providing a public git repository is trivial. Git itself can allow repositories to be cloned directly over HTTP without needing infrastructure like WebDAV, and here are links for the cipherdyne.org projects (github and gitweb links too):

The trac interface will remain active for a little while to see the legacy svn repositories, but the git repositories were all converted from these in order to preserve the history so trac is no longer important. If you are interested in the latest code changes in, say, fwsnort then just clone the repository and then you can make your own changes: $ git clone http://www.cipherdyne.org/git/fwsnort.git
Initialized empty Git repository in /home/mbr/tmp/git/fwsnort/.git/
$ cd fwsnort
$ git status
# On branch master
nothing to commit (working directory clean)
$ git show --summary
commit 00c4379a69975097948ed9e5ba356eeba69c0c93
Author: Michael Rash <mbr@cipherdyne.org>
Date: Mon Jun 20 21:00:57 2011 -0400

Added the --Conntrack-state argument

Added the --Conntrack-state argument to specify a conntrack state in place of
the "established" state that commonly accompanies the Snort "flow" keyword.
By default, fwsnort uses the conntrack state of "ESTABLISHED" for this. In
certain corner cases, it might be useful to use "ESTABLISHED,RELATED" instead
to apply application layer inspection to things like ICMP port unreachable
messages that are responses to real attempted communications. (Need to add
UDP tracking for the _ESTAB chains for this too - coming soon.)

Google Indexing of Trac Repositories

Google Trac Indexing There has been a Trac repository available on trac.cipherdyne.org since 2006, and in that time I've collected quite a lot of Apache log data. Today, the average number of hits per day against the fwknop, psad, fwsnort, and gpgdir Trac repositories combined is over 40,000. These hits come mostly from search engine crawlers such as Googlebot/2.1, msnbot/2.0b, Baiduspider+, and Yahoo! Slurp/3.0;. However, not all such crawlers are created equal. It turns out that Google is far and away the most dedicated indexer of the cipherdyne.org Trac repositories, and to me this is somewhat surprising considering the reputation Google has for efficiently crawling the web. That is, I would have expected that the non-Google crawlers would most likely hit the Trac repositories on average more often than Google. But perhaps Google is extremely interested in getting the latest code made available via Trac indexed as quickly as possible, and for svn repositories that are made accessible only via Trac (such as the cipherdyne.org repositories), perhaps brute force indexing of all possible links in Trac is better. Or, perhaps the other search engines are simply not as interested in code within Trac repositories so they don't bother to aggressively index it, or maybe they are just not very thorough when compared to Google.

Let's see some examples. The following graphs are produced with the webalizer project. This graph shows the uptick in hits against Trac shortly after it was first made available in 2006: cipherdyne 2006 Trac usage So, the average number of hits goes from 803 starting out in May and jumps rapidly to nearly 17,000 in December. Here are the top five User-Agents and associated hit counts:

HitsPercentageUser-Agent
506237.06%Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
500436.63%noxtrumbot/1.0 (crawler@noxtrum.com)
12839.39%msnbot/0.9 (+http://search.msn.com/msnbot.htm)
4753.48%Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.13) Gecko/20060418...
4573.35%Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.12) Gecko/20051229...

Right off the bat Google is the top crawler, but only just barely. In December, this changes to the following:

HitsPercentageUser-Agent
47806390.94%Mozilla/5.0 (compatible; Googlebot/2.1; ...
112992.15%Mozilla/5.0 (compatible; BecomeBot/3.0; ...
82871.58%msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)
64231.22%Mozilla/5.0 (compatible; Yahoo! Slurp; ...
40290.77%Mozilla/2.0 (compatible; Ask Jeeves/Teoma; ...

Now things are drastically different - Google's crawler now accounts for over 90% of all hits against Trac, and has created over 42 times the number of hits from the second place crawler "BecomeBot/3.0" (a shopping-related crawler - maybe they like the price of "zero" for the cipherdyne.org projects).

Let's fast forward to 2009 and take a look at how things are shaping up (note that data from 2008 is not included in this graph): cipherdyne 2009 Trac usage The month of May was certainly an aberration with over 56,000 hits per day, and August topped out at 42,000 hits per day. In May, the top five crawlers were:

HitsPercentageUser-Agent
151043586.14%Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
737404.21%Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/
334821.91%Mozilla/5.0 (compatible; Charlotte/1.1; http://www.searchme.com/support/)
226991.29%Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; ...
209801.20%Mozilla/5.0 (compatible; MJ12bot/v1.2.4; ...

Google maintains a crawling rate of over 20 times as many hits as the second place crawler "DotBot/1.1". In August, 2009 (some of the most recent data), the crawler hit counts were:

HitsPercentageUser-Agent
94770386.02%Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
438763.98%msnbot/2.0b (+http://search.msn.com/msnbot.htm)
309512.81%Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/
115781.05%Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; ...)
108800.99%msnbot/1.1 (+http://search.msn.com/msnbot.htm)

This time "msnbot/2.0b" makes it to second place, but it is still far behind Google in terms of hit counts. So, what is Google looking at that needs so many hits? One clue perhaps is that Google likes to re-index older data (probably to ensure that a content update has not been missed). Here is an example of all hits against a link that contains the "diff" formated output for changeset 353 in the fwknop project. The output is organized by each year since 2006, and the first command counts the number of hits from Google, and the second command shows all of the non-Googlebot hits: [trac.cipherdyne.org]$ for f in 200*; do echo $f; grep diff $f/trac_access* |grep "/trac/fwknop/changeset/353" | grep Googlebot |wc -l; done
2006
6
2007
4
2008
6
2009
2

[trac.cipherdyne.org]$ for f in 200*; do echo $f; grep diff $f/trac_access* |grep "/trac/fwknop/changeset/353" | grep -v Googlebot ; done
2006
74.6.74.211 - - [10/Oct/2006:07:38:21 -0400] "GET /trac/fwknop/changeset/353?format=diff HTTP/1.0" 200 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
205.209.170.161 - - [19/Nov/2006:14:17:57 -0500] "GET /trac/fwknop/changeset/353?format=diff HTTP/1.1" 200 - "-" "MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"
2007
2008
38.99.44.102 - - [27/Dec/2008:17:51:34 -0500] "GET /trac/fwknop/changeset/353/?format=diff&new=353 HTTP/1.0" 200 - "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
2009
208.115.111.250 - - [14/Jul/2009:20:42:32 -0400] "GET /trac/fwknop/changeset/353?format=diff&new=353 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)"
208.115.111.250 - - [28/Jun/2009:10:00:32 -0400] "GET /trac/fwknop/changeset/353?format=diff&new=353 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)"
So, Google keeps hitting the "changeset 353" link multiple times each year, whereas the other crawlers (except for DotBot) each hit the link once and never came back. Further, many crawlers have not hit the link at all, so perhaps they are not nearly as thorough as Google.

A few questions come to mind to conclude this blog post. Please contact me if you would like to discuss any of these:
  • For any other people out there who also run Trac, what crawlers have you seen in your logs, and does Google stand out as a more dedicated indexer than the other crawlers?
  • For anyone who runs Trac but also makes the svn repository directly accessible, does Google continue to aggressively index Trac? Does the svn access imply that whatever code is versioned within is used as a more authoritative source than Trac itself?
  • It would seem that any crawler could implement an optimization around the Trac timeline feature so that source code change links are indexed only when a timeline update is made. But, perhaps this is too detailed for crawlers to worry about? It would require additional machinery to interpret the Trac application, so search engines most likely try to avoid such customizations.
  • Why do the non-Google crawlers lag so far behind in terms of hit counts? Are the differences in resources that Google can bring to bear on crawling the web vs. the other search engines so great that the others just cannot keep up? Or, maybe the others are just not so interested in code that is made available in Trac?

Analyzing a Trac SPAM Attempt

Spam Attempts Through Trac One of the best web interfaces for visualizing Subversion repositories (as well as providing integrated project management and ticketing functionality) is the Trac Project, and all Cipherdyne software projects use it. With Trac's success, it also becomes a target for those that would try to subvert it for purposes such as creating spam. Because I deploy Trac with the Roadmap, Tickets, and Wiki features disabled, my deployment essentially does not accept user-generated content (like comments in tickets for example) for display. This minimizes my exposure to Trac spam, which has become a problem significant enough for various spam-fighting strategies to be developed such as configuring mod-security and a plugin from the Trac project itself.

Even though my Trac deployment does not display user-generated content, there are still places where Trac accepts query strings from users, and spammers seem to try and use these fields for their own ends. Let's see if we can find a few examples. Trac web logs are informative, but sifting through huge logfiles can be tedious. Fortunately, simply sorting the logfiles by line length (and therefore by web request length) allows many suspicious web requests to bubble up to the top. Below is a simple perl script sort_len.pl that sorts any ascii text file by line length, and precedes each printed line with the line length followed by the number of lines equal to that length. That is, not all lines are printed since the script is designed to handle large files - we want the unusually long lines to be printed but the many shorter lines (which represent the vast majority of legitimate web requests) to be summarized. This is an important feature considering that at this point there is over 2.5GB of log data specifically from my Trac server.

$ cat sort_len.pl
#!/usr/bin/perl -w
#
# prints out a file sorted by longest lines
#
# $Id: sort_len.pl 1739 2008-07-05 13:44:31Z mbr  $
#

use strict;

my %url       = ();
my %len_stats = ();
my $mlen = 0;
my $mnum = 0;

open F, "< $ARGV[0]" or die $!;
while (<F>) {
    my $len = length $_;
    $url{$len} = $_;
    $len_stats{$len}++;
    $mlen = $len if $mlen < $len;
    $mnum = $len_stats{$len}
        if $mnum < $len_stats{$len};
}
close F;

$mlen = length $mlen;
$mnum = length $mnum;

for my $len (sort {$b <=> $a} keys %url) {
    printf "[len: %${mlen}d, tot: %${mnum}d] %s",
            $len, $len_stats{$len}, $url{$len};
}

exit 0;
To illustrate how it works, below is the output of the sort_len.pl script used against itself. Note that at the top of the output the more interesting code appears whereas the most uninteresting code (such as blank lines and lines that contain closing "}" characters) are summarized away at the bottom:
$ ./sort_len.pl sort_len.pl
[len: 51, tot: 1] # $Id: sort_len.pl 1739 2008-07-05 13:44:31Z mbr  $
[len: 50, tot: 1]     printf "[len: %${mlen}d, tot: %${mnum}d] %s",
[len: 48, tot: 1]             $len, $len_stats{$len}, $url{$len};
[len: 44, tot: 1] # prints out a file sorted by longest lines
[len: 43, tot: 1] for my $len (sort {$b <=> $a} keys %url) {
[len: 37, tot: 1]         if $mnum < $len_stats{$len};
[len: 34, tot: 1]     $mlen = $len if $mlen < $len;
[len: 32, tot: 1] open F, "< $ARGV[0]" or die $!;
[len: 29, tot: 1]     $mnum = $len_stats{$len}
[len: 25, tot: 1]     my $len = length $_;
[len: 24, tot: 1]     $len_stats{$len}++;
[len: 22, tot: 2] $mnum = length $mnum;
[len: 21, tot: 1]     $url{$len} = $_;
[len: 20, tot: 2] my %len_stats = ();
[len: 19, tot: 1] #!/usr/bin/perl -w
[len: 14, tot: 3] while (<F>) {
[len: 12, tot: 1] use strict;
[len:  9, tot: 1] close F;
[len:  8, tot: 1] exit 0;
[len:  2, tot: 5] }
[len:  1, tot: 6] 
Now, let's execute the sort_len.pl script against the trac_access_log file and look at one of the longest web requests. (The sort_len.pl script was able to reduce the 12,000,000 web requests in my Trac logs to a total of 610 interesting lines.) This particular request is 888 characters long, but there were some other similar suspicious requests that had over 4,000 characters that are not displayed for brevity: [len: 888, tot: 1] 195.250.160.37 - - [02/Mar/2008:00:30:17 -0500] "GET /trac/fwsnort/anydiff?new_path=%2Ffwsnort%2Ftags%2Ffwsnort -1.0.3%2Fsnort_rules%2Fweb-cgi.rules&old_path=%2Ffwsnort%2Ftags%2F fwsnort-1.0.3%2Fsnort_rules%2Fweb-cgi.rules&new_rev=http%3A%2F%2Ff 1234.info%2Fnew5%2Findex.html%0Ahttp%3A%2F%2Fa1234.info%2Fnew4%2F map.html%0Ahttp%3A%2F%2Ff1234.info%2Fnew2%2Findex.html%0Ahttp%3A%2F %2Fs1234.info%2Fnew9%2Findex.html%0Ahttp%3A%2F%2Ff1234.info%2Fnew6%2F map.html%0A&old_rev=http%3A%2F%2Ff1234.info%2Fnew5%2Findex.html%0Ahttp %3A%2F%2Fa1234.info%2Fnew4%2Fmap.html%0Ahttp%3A%2F%2Ff1234.info%2Fnew2 %2Findex.html%0Ahttp%3A%2F%2Fs1234.info%2Fnew9%2Findex.html%0Ahttp%3A %2F%2Ff1234.info%2Fnew6%2Fmap.html%0A HTTP/1.1" 200 3683 "-" "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2)" My guess is that the above request is a bot that is trying to do one of two things: 1) force Trac to accept the content in the request (which contains a bunch of links to pages like "http://f1234.info/new/index.html" - note that I altered the domain so as to not legitimize the original content) and display it for other Trac users or to search engines, or 2) force Trac itself to generate web requests to the provided links (perhaps as a way to increase hit or referrer counts from domains - like mine - that are not affiliated with the spammer). Either way, the strategy is flawed because the request is against the Trac "anydiff" interface which doesn't accept user content other than svn revision numbers, and (at least in Trac-0.10.4) such requests do not cause Trac to issue any external DNS or web requests - I verified this with tcpdump on my Trac server after generating similar requests against it.

Still, in all of my Trac web logs, the most suspicious web requests are against the "anydiff" interface, and specifically against the "web-cgi.rules" file bundled within the fwsnort project. But, the requests never come from the same IP address, the "anydiff" spam attempts never hit any other link besides the web-cgi.rules page, and they started with regularity in March, 2008. This makes a stronger case for the activity coming from bot that is unable to infer that its activities are not actually working (no surprise there). Finally, I left the original IP address 195.250.160.37 of the web request above intact so that you can look for it in your own web logs. Although 195.250.160.37 is not listed in the Spamhaus DNSBL service, a rudimentary Google search indicates that 195.250.160.37 has been noticed before by other sites as a comment spammer.

Deploying the Trac Timeline and Browse Source Features

Deploying the Trac Timeline and Browse Source Features The Trac Project provides an excellent web interface for project documentation, ticket-based project management, and visualization of source code (from a Subversion repository). All of the Cipherdyne open source projects use Trac as the code display mechanism, and allow users to readily see how the projects are developed over time. However, the Cipherdyne projects are not large software engineering efforts with many dedicated developers, and I prefer not to allow my webserver to accept user registrations and tickets even if it is through Trac. The project mailing lists as well as personal email correspondence has been (so far) completely sufficient for bug reporting, and development tasks are tracked within TODO files in each of the top level source directories for each project. Trac also offers a Wiki, but I write all of my own documentation for my projects and I accept enhancements via email as well.

So, why the blog post? Well, it is not immediately obvious how to enable just the Timeline, Browse Source, and Search features in Trac, and disable the remaining features such as the Roadmap, Tickets, and the Wiki. (All of these features can be seen in the Trac site in the navigation bar off to the right.) It turns out that Trac is database driven, and disabling these features can be accomplished from the command line as follows with the trac-admin command: $ trac-admin /path/to/trac_directory permission remove anonymous TICKET_CREATE TICKET_MODIFY TICKET_VIEW ROADMAP_VIEW REPORT_VIEW MILESTONE_VIEW REPORT_SQL_VIEW WIKI_CREATE WIKI_MODIFY WIKI_VIEW Also, you will need to set the default_handler variable in the conf/trac.ini file to BrowserModule instead of WikiModule. Using the above command, the Trac navigation bar only includes the Timeline, Browse Source, and Search features as seen here, and this is a valuable configuration for small open source projects. However, if you would like additional functionality in Trac to be enabled for the Cipherdyne projects please email me; perhaps there are benefits here that would justify the change.