Trace your referrers in real-time

One of the magic things about Dreamhost is the ability to login to your account using ssh. Apart from standard things that you can do with it, like compile and install any program that you like (excluding those, which require root access, of course), you can also experience a little bit of magic if you run this command:

tail -f current-httpd-accesslog

There is really something special about tracking your website’s visitors as they come. And, of course, the most interesting thing is actually knowing where they come from (also known as referrers). The problem with this command is that it prints lots of garbage on the screen (like timestamps, response codes, sizes, etc.). The problem is that you cannot grep your real-time tailed log. Why? Don’t ask me, I’m not a Linux guru. You just can’t and that’s it. You can, however, write a script, which deals with that in its own way. And this is what I’ve been writing for the whole day. It was both fun and painful to learn for the n-th time all those shell hacks and quirks. Was it worth it? Sure it was! As a result I came up with this little bash script to trace your referrers in real-time. You can download it or view it below. I must warn you, though. It is highly addictive. Really…

#!/bin/bash

# ========================================================================
# REAL-TIME WEBSITE REFERERS TRACER
# ========================================================================
#
# What?
#   Real-time website referrers tracer is a shell script that lets you
#   trace your visitors as they come. Script works in an ultra compact
#   four-columns view :)
#
# Why?
#   Because you cannot do 'tail -f access_log | grep something' and you
#   really want to grep out most of the stuff that your httpd puts in
#   the logs.
#
# Requirements:
#   - website (with not too low and not too high traffic),
#   - shell account on the server where your website is hosted,
#   - access to httpd logs that use the COMBINED format.
#
# Installation:
#   - copy anywhere in your home directory,
#   - edit the script and set the 'log' variable so it actually
#     points to your current httpd log,
#   - make sure the script has execute rights (chmod +x trace-referers).
#
# Running:
#   - just run the script and watch the screen.
#
#
# Version: 0.2 (2005-06-18)
# Author: Paweł Gościcki, http://pawelgoscicki.com
#
# No copyright rights. You can do whatever you want with this. You may even
# claim this scrip has been written by you from the very beginning ;)
#
# If you, however, improve it, send me a copy (paul_AT_pawelgoscicki.com).
#
# Based heavily on the tgrep script by Ed Morton (morton_at_lsupcaemnt.com):
# http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2004-01/0818.html


# CONFIGURATION
# =============

# Where your httpd log file is
log="current-httpd-accesslog"

# What files to exclude (request for those files won't be shown, regexp syntax)
exclude="\.gif|\.jpg|\.png|\.ico|\.css|\.js"

# Width of request and referrer columns (set it to match your terminal's width)
col_width=35


# MAIN SCRIPT
# ===========

# Check if log file actually exists (and is readable)
if [ ! -r "${log}" ]; then
echo "Cannot access log file: $log"
exit 0
fi

# After startup we will output few lines
start=`wc -l < "${log}"`
start=$(( $start - 30 ))
if (( ${start} < 0 ))
then start=$((0))
fi

# Main loop
while :
do
  end=`wc -l < "${log}"`
  end="${end##* }"
  if (( ${end} > ${start} ))
  then
    start=$(( $start + 1 ))
    sed -n "${start},${end}p" "${log}" | egrep -v "${exclude}" | \
    awk -v col_width=$col_width '{

      # we are only interested in GET/POST requests
      if ( match($0, /\"(GET|POST).*?\"/) > 0 )
      {
        split($0, fields, "\"")

        # IP_ADDRESS
        tmp = $1
        while ( length(tmp) < 15 ) tmp = tmp " "
        printf "%s", tmp " "
    
        # HTTP_REQUEST (GET/POST)
        tmp = substr(fields[2], 0, index(fields[2], "HTTP/") - 1 )
        tmp = substr(tmp, index(tmp, " ") + 1, col_width)
        while ( length(tmp) < col_width ) tmp = tmp " "
        printf "%s", tmp " "
    
        # REFERER (the juice)
        tmp = fields[4]
        while ( length(tmp) < col_width ) tmp = tmp " "
        printf "%s", tmp " "
    
        # USER_AGENT
        printf "%s", fields[6]
    
        # new line at the end
        printf "\n"
      }
    }'

    start=${end}
  fi

  # this is an endless loop executed every second
  sleep 1
done

Your current hosting provider does not support ssh access? You might then want to read my other post about hosting with dreamhost for as little as 9$/year. Have fun!