Use AWStats with Amazon S3 / CloudFront

I’ve recently written two posts about AWS and log processing with an external server so here comes a final post to wrap it all up and tie it together.

Goal

To have an automatic cron process that polls Amazon S3 and/or CloudFront log files to your own server and then using logresolvemerge to combine them for processing with AWStats.

Requirements

An Amazon Web Services account where you collect your log files in a bucket. Your own dedicated server, VPS or host where you have shell access to install and configure your own solutions so you can have Python and boto installed as well as adding your own scripts. Also you should have AWStats installed and up and running.
I’ve made this setup using Ubuntu 10.04 which I run on a VPS over at Linode where I currently run a couple of projects.

Setup

I’m currently collecting logs from one CloudFront distribution and one S3 bucket which I have mapped my own CNAMEs to. Let’s call them s3.example.com and cdn.example.com. Then I have another bucket which is not public but where I store other things. So in this bucket I’ve made a folder named logs and in that folder I have a folder for each domain.

So the log prefixes for me in this case becomes

/logs/s3.example.com
/logs/cdn.example.com

Download AWS Logs

To be able to download the log files from Amazon to a local directory, check out my earlier post about downloading AWS logs with boto and Python.

Configure AWStats

Now we need to setup some AWStats configuration to prepare it for handling the AWS logs. I’ll create to configuration files for this example:

/etc/awstats/awstats.cdn.example.com.conf
/etc/awstats/awstats.s3.example.com.conf

I assume that you already know your way around configuring AWStats so I’ll focus on the specifics for AWS compability. The final log files that will be created later on for AWStats to use will be stored in /var/log/apache2/ so I point the LogFile option to that location. And then we just have to setup the LogFormat correctly. Below is the setup for S3 log files and then for CloudFront log files.

S3 AWStats LogFormat

LogFile="/var/log/apache2/s3.example.com.log"
LogFormat="%other %extra1 %time1 %host %logname %other %method %url %otherquot %code %extra2 %bytesd %other %extra3 %extra4 %refererquot %uaquot %other"

CloudFront AWStats LogFormat

LogFile="/var/log/apache2/cdn.example.com.log"
LogFormat="%time2 %cluster %bytesd %host %method %virtualname %url %code %referer %ua %query"

If you also want the CloudFront statistics to display information about the edges you can check out my post about CloudFront Edges in AWStats.

Automate everything

Now when we have all components in place, we just need to automate them so we later on can add it to cron. I’ve made a bash script which takes care of the automation. The script is not very complicated but I’ll make a quick walk through of it, so it can be modified to specific needs and setups. I added a number at the comment for each section in the script which I use as a reference in the list below.

  1. A few variables used in the script. The date variable is just to collect the current date. I don’t really use this information at the moment other than appending it to a temporary directory name. But in case I want to expand on the script in the future to keep the archives around it could be handy. Then I create a variable for each log I want to process. In this example I process two logs, one S3 and one CloudFront, so I have 2 variables here containing paths to temp directories where the log files will be downloaded.
  2. Here we use the boto Python script I created earlier to download all log files from Amazon to our local temp directories.
  3. Now when all the log files have been downloaded, we need to combine them into a format that AWStats can understand. The first line combines the CloudFront logs. They are very straightforward so they just need to be combined into one large file and AWStats are ready to process it. Then the second line is to process the S3 log files into one large log file. S3 is a bit more tricky as it contains a few things AWStats don’t understand, so I use a regexp to remove the things that would cause AWStats some headache. I store my AWS final log files in /var/log/apache2/ which is the path I defined in the LogFile option for AWStats earlier.
  4. Our log files are now downloaded and combined into the final log files that are stored in /var/log/apache2/ so I simply delete the temporary downloaded files, as I don’t need to keep them around anymore.
  5. And finally we execute AWStats to update the statistics with the log files we just have processed.

get-aws-logs.sh

#!/bin/bash
# Initial, cron script to download and merge AWS logs
# 29/11 - 2010, Johan Steen

# 1. Setup variables
date=`date +%Y-%m-%d`
cdn_folder="/tmp/log_cdn_$date/"
static_folder="/tmp/log_static_$date/"

# 2. Call the python script to download log folders from Amazon to local folders
python /home/johan/get-aws-logs.py --prefix=logs/cdn.example.com/ --local=$cdn_folder
python /home/johan/get-aws-logs.py --prefix=logs/s3.example.com/ --local=$static_folder

# 3. Merge and add the downloaded log files to the local log file
/usr/local/bin/logresolvemerge.pl ${cdn_folder}* >> /var/log/apache2/cdn.example.com.log
/usr/local/bin/logresolvemerge.pl ${static_folder}* | sed -e 's/SOAP\.\([A-Z]*\)/\1/' -e 's/REST\.\([A-Z]*\)\.[A-Z]*/\1/' >> /var/log/apache2/s3.example.com.log

# 4. Delete the downloaded log files
rm -rf $cdn_folder
rm -rf $static_folder

# 5. Update the AWStats Logs
/usr/lib/cgi-bin/awstats.pl -config=cdn.example.com -update
/usr/lib/cgi-bin/awstats.pl -config=s3.example.com -update

Cron it!

And finally, add the bash script to your cron to be run as often as you feel is appropriate for your setup.

# Process the AWS Logs at 4:43 every night
43 4 * * * root /home/johan/get-aws-logs.sh >/dev/null

And that’s it. Feel free to leave a comment if you have any questions or suggestions for improvements.