Ian Bicking: the old part of his blog

Where are the log analyzers?

I was doing a bit of evaluation of log analyzers, and was a bit disappointed by what I found. It seems like the only serious open source log analyzers are Analog, AWStats, and Webalizer. They all seem okay, but not great. They don't seem any different then they were five years ago.

Am I missing something? This doesn't seem like a hard problem. It's also something that would be useful to solve well. All it should take is a egoist who is obsessed with their logs. Isn't there a glut of such people out there, with the necessary programming skill? That described me at one point, though I never really carried it to completion. Why hasn't something better come along? Or has something better come along, and whoever made it forgot to tell everyone?

Sometimes it's easy to understand why some holes in open source exist, but this one doesn't make any sense to me.

Created 19 Oct '04
Modified 14 Dec '04

Comments:

Two points:

* Most of the interesting things to do in log analysis are of more interest to "marketing people" than to "technical people".

* The "okay" analyzers are obviously "good enough" for "technical people". :)

# Phillip J. Eby

1. What are you looking for but not finding?

2. Would you *stop* working on exactly what I'm working on every day? ;) I just picked Analog from the same candidates on Saturday.
# Robert Brewer

awstats can be a pain to install, but I think it's more useful than the others. There's quite a bit of useful information, but what you define as useful depends on what information you're looking for!
# Eric Radman

Anyone with a website who is interested in how that website is read should be interested in log analysis. There's a reason most blog software has referrer tracking built in, and many have other kinds of tracking as well. Log analysis provides everything the blog software can do, but is more general to other websites. So I really don't think it's a lack of interest by technical people.

Maybe I'll write a list of features I think should exist, but don't, or don't work as well as they should.
# Ian Bicking

Phillip was dead on when he said log analysis is only of interest to marketing people. Writing tools for marketing people that are not hard or interesting problems to solve is something open source is bad at doing. The person that can scratch isn't the one who itches.

I won't say this is a completely uninteresting problem, my bread and butter is web analytics (written in python). It is sometimes interesting, using standard x86 boxes to handle hundreds of thousands of hits a day each and report on the information in real time has some challenges. But I certainly wouldn't be doing it as a hobby.

Additionally picking the right metrics to use and presenting them in an understandable way needs at least one statistics guy and one UI guy. Open Source apps have trouble with consistancy here, you either end up with a mountain of uselessly specific reports or a few generic ones.
# Jack Diederich

I would expect something that was general enough to make defining reports easy, and displaying those reports somewhat well. The actual detailed reports is something that takes time and thought to figure out, and it isn't very general, so I wouldn't expect that to come out of open source quickly; but we're great at making tools. For instance, Yet Another Advanced Logfile Analyser is an effort in that direction, defining a query language and a generic report based on that.
# Ian Bicking

I tried my hand at log analysis, but finally figured that's not where it's at for the more advanced stuff.

I'd suggest looking at how phpOpenTracker uses images/javascript to get the necessary information to generate (incredibly slowly, at least on one of my previous company's about-a-million-of-users-a-month web sites) a lot of the information you'd likely need. I'm not sure if phpOpenTracker is the one that does it (I don't think it is), but there's a tracker which uses the same trick, and you can dynamically replay "click-trails" for individual users and so forth, and generates "average stay per page", and "common paths" and so forth. Now if I could only recall the name...

A nice-ish bonus is that the images/javascript has a very good chance of not picking up bots. And with an additional white-list, you might still be able to pick up Lynx/Links/w3m users. I spend way too much time removing bots - was considering going white-list only at one stage, but lots of the bots fake User Agents to mimic browsers. And doing additional analysis to pick up bots at log-file-analysis time is quite hard.
# Neil Blakey-Milner

Ian's got a point. Look at the feature set of commercial log analysis tools, and you see how they do customer tracking, why shopping carts are abandoned, where are the traffic coming from, what search words were used, how much time do people spend on a page before moving on. Intranets could do with measuring how effective certain pages are, what resources are not getting used because of inattention. Log analysis could answer these questions.
# Chui

Part of the reason is that once you have a sufficient understanding of HTTP, you realise that virtually all the reasons you wanted log analysis are not suitable for analysis. Things like caching make it practically impossible to get number of visitors etc. There are tricks you can do (e.g. cookies), but there are always things that make the tricks unreliable (e.g. people refusing cookies).

HTTP log analysis is good for server optimisation, and not a lot else.
# Jim