Ian Bicking: the old part of his blog

Mailing list archivers

I'm getting ready to do a project that involves setting up Mailman lists and archives and a simplified web interface and whatnot. Seems like basic stuff, though we've investigated it enough so far not to expect it to be basic. But I've looked through Mailman source in the past, and it's not that hard to grok.

But what's really starting to make me wonder is the archiving -- I can't find any good archivers. Everyone seems to agree that Pipermail is obsolete. MHonarc is still alive, though it doesn't seem to be lively, and Hypermail is... I don't know what Hypermail is. Looks just like Pipermail to me. They all produce rather boring static HTML pages, without a whole lot of added value. And what's up with MailBoxer? I'm a little nervous about an all-in-one package (and Zope, and DTML).

I must be missing something. Don't people want nice mail archives? I know this isn't rocket science, so why hasn't someone done something cool? I'm starting to feel like I should just code something myself -- the email module does the work I'd be scared to do (parsing real-world email messages), and I could use dbmail (if I fully trusted a big database archive, which I'm unsure of). For search I'm still uncertain, but I think PyLucene is probably easier to build and use than the first time I tried it, and everyone seems to like its results.

Maybe I should look at this as an "opportunity". But dear lazyweb, I am very willing to ride on other people's coat tails, just tell me whose...

Created 04 Apr '05

Comments:

I wrote a mailing list archive thing for the css-discuss list in PHP a few years ago, but the source was never released. It took the best part of 12 hours - they really aren't very complicated pieces of software. In Python it would be even easier thanks not only to the email module but also to AMK's threading module: http://www.amk.ca/python/code/jwz

For search, I'm a huge fan of http://www.swish-e.org . It has good Python bindings (if you hunt around a bit) and its easy to set it up to index pretty much anything. I used it for the search engines on http://www.ljworld.com and http://www.lawrence.com .

# Simon Willison

I had a similar feeling about pipermail and wrote a tiny app for the quixote and durus user lists (at http://mail.mems-exchange.org/durusmail/). Like you said, this is not rocket science.

I think there might not be an obvious consensus about what features are desirable for this purpose. For me, I wanted threaded presentation, but no searching-- I'm satisfied to let Google take care of that.

# David Binger

Instead of PyLucene I would suggest Xapian (http://www.xapian.org). The indexer library is written in C++, is damn fast, scales well and has nice Python bindings. I have written a small desktop search app in only one evening.

The email package is very helpfull for parsing mail but you still have a lot of stuff to do. See http://www.strakt.com/docs/ep04_email.pdf (Real-world email handling in python).

# Henning

I just tried Xapian out -- builds easily, looks like fairly straight-forward Python bindings, and okay documentation. Very interesting. PyLucene still drives me nuts to build, so it's just not worth pursuing at the moment, and I'm not even sure if it offers anything over Xapian. I'd looked at Swish-E quite a bit before, but Xapian looks a little more modern and with better-defined layers.
# Ian Bicking

Aside from Mailboxer you could also look into PlonePostOffice. It may not be what you want (being a Zope/Plone product) but at least it does not use DTML ;-) The largest difference with Mailboxer is that it stores the email as content in the ZODB. This enables the normal Zope catalog and Plone workflow to do interesting things with it.

# Jeroen Vloothuis

If I'm scared of an RDBMS archive, I'm doubly scared of ZODB ;) But thanks for the link -- if I was in Plone I could see how this would be very nice.
# Ian Bicking

You wrote: "I'm starting to feel like I should just code something myself". It's exactly my feeling some time ago! And I also managed to stay lazy.

But the big differences in our approaches are technical details. I'd like to introduce MailML, a format for specifying the structure in mail messages. The possible logical blocks are titles, attachments, cites, signatures, urls, addresses and other. The MailML documents can be stored in XML database and queried through the web, something like Syntaco (http://www.syncato.org/). It should be an ultimately powerful system.

I sure it's a business opportunity. But I don't have time to try it. Unfortunately.

# olpa

IIRC, someone (was it AMK?) wanted to do a whole new archive web interface as well. I don't think it got real, though.

I am interested in a good mailman archiver as i am still looking for one for codespeak's mailing lists. However, i probably want something where i can stuff in source code, documentation, IRC-logs and mailing-list archives and have a nice search interface for all that (displaying search results in content-specific ways but just so that it reads nicely).

# holger krekel

> I know this isn't rocket science, so why hasn't someone done something cool?

I think that to myself. All the time. About everything.

Anyway, as you probably know, i want to release a tool that does this.

(Typing in this comment using ReStructuredText was excruciating.)

# Ping

Take a look at lurker. Have been using it for quite a while and it's pretty good.

http://lurker.sourceforge.net/
# Meng Kuan

Thanks Meng -- I hadn't seen that, but looking at it I'm quite impressed. The C++ underpinnings scare me a little -- C++ web applications seem weird -- but I expect the XSL layer will give me the flexibility I need. And the actual featureset looks great, so I think we'll end up using that. (I still like Xapian/Omega, so I may look into using that for website searches)
# Ian Bicking

I am working on mod_mbox: http://svn.apache.org/repos/asf/httpd/mod_mbox/trunk/

The current focus is on adding searching via Lucene.

I am also not happy with the alternatives :)

# Paul Querna