Rietta
Rietta: Web Apps Where Security Matters
You are reading The Rietta Blog, a publication about the web since 2005.

Find Top Referral Sources With Raw Apache Access Log

Comments

In today’s issue of the Mastering the Terminal series, I present to you the easy way to find your top website referral sources using only tools available on the Linux (or Unix) command line and your raw Apache access file.

The command to run

If you want a list of the top referrers that send traffic to your website, you can use a tool like Google Analytics or the fantastic, self-hosted Piwik, an open source web analytics platform. Or, if like me, you just want a quick peek into the latest trends then you can get this yourself – for free – using nothing more than SSH and your Apache raw access file.

SSH into your web server, go to your log directory, and run this:

1
2
3
4
5
6
7
grep "200 " access.log \
  | cut -d '"' -f 4 \
  | sort \
  | uniq -c \
  | sort -rn \
  | grep -v "YOURDOMAIN.COM" \
  | less

You can also condense this to a single line by removing the backslashes \, which simply tell the Bash shell to continue processing the command on the next line rather than executing the command right then.

I just ran this report on this blog’s access log and a sample of the results are:

1
2
3
4
5
6
7
8
9
19893 -
   468 https://www.google.com/
   135 https://www.google.co.in/
    57 https://www.google.co.uk/
    ...
    19 http://www.reddit.com/r/MechanicalKeyboards/
    19 https://www.google.it/
    18 http://stackoverflow.com/questions/2872792/is-it-easy-to-develop-a-simple-firefox-plugin-js-injection
    ...

Since Google has switched to HTTPS, more entries are simply stamped with “–”, which just tells you that the referrer is not known.

Let’s break this down

The raw Apache log entry

An example of an entry in the Apache raw access file is: 192.168.1.1 - - [04/Feb/2014:00:44:26 -0500] "GET / HTTP/1.1" 200 5241 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11"

It tells you the IP address of the visitor, the file requested, the HTTP status code, the bytes transferred, the HTTP referrer, and the user agent string.

The shell pipeline

This command is a composite that uses the Unix pipe to sort through your log file. This pipeline is one of the most powerful features of a Unix-based system that I use daily in my work.

The commands used are, with descriptions from their respective man pages (link to FreeBSD, but these commands are the same in Linux):

  • grep “searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (–) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.”, with flags used:
    • -v “Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.)”
  • cut “removes sections from each line of files”
  • sort to “sort lines of text files”, with the flags:
    • -n “compare according to string numerical value”
    • -r “reverse the result of comparisons”
  • uniq “report or omit repeated lines”, with flags:
    • -c “prefix lines by the number of occurrences”
  • less “is a program similar to more (1), but which allows backward movement in the file as well as forward movement. Also, less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1).””

The order of commands in the pipeline:

  1. grep "200 " access.log: Scan the raw Apache log file, called access.log here for successful requests, with a 200 HTTP status code.
  2. cut -d '"' -f 4: Split each line by the “ character and pick out the fourth column, this isolates the referrer URL in the standard Apache access log entry format.
  3. sort: This is easy, it sorts the output!
  4. uniq -c: Filter out duplicates and indicate the counts of each
  5. sort -rn: Sort by most accessed to least accessed.
  6. grep -v "YOURDOMAIN.COM": Filter out this websites domain name, which is going to be the biggest referrer for most requests. Leave this out if you want to analyze internal referral traffic.
  7. less: It’s pager that will let you scroll through the results interactively. You could pipe the output to a file if you prefer.

Conclusion

The command line tools available to you are very, very flexible. It’s easy to use them to gain insight into your website’s traffic without the need for a heavier analytics package.

It may well be wise to additionally use an analytics package, which I do on my own sites. But from my point of view, using grep, cut, and sort is my preferred go-to for getting specific information out of my logs.

Further reading

About Frank Rietta

Frank Rietta's photo

Frank Rietta is specialized in working with startups, new Internet businesses, and in developing with the Ruby on Rails platform to build scalable businesses. He is a computer scientist with a Masters in Information Security from the College of Computing at the Georgia Institute of Technology. He teaches about security topics and is a contributor to the security chapter of the 7th edition of the "Fundamentals of Database Systems" textbook published by Addison-Wesley.

Comments