Wed, 15 Oct 2008
I had an old project that never quite succeeded to mine Wikipedia for style edits — through this, one could learn what made a style improvement, and attempt to generalize that to other texts. Think of it as the minimal boostrapping of a purely-statistical "grammar checker." It was originally to be my masters project under the awesome Jason Eisner. (Instead, I contributed to a storage layer branch of Dyna's compiler.)
It has some nice filters (written using SAX, so they don't chew up all your RAM for pages with 2GB of history) for filtering down MediaWiki page dumps into just what we want, also optionally modifying their text so that the new outputted versions contain the results of data processing.
The code also has some hilarious (and useful!) Makefiles that treat a bunch of heterogenous computers as a compute cluster. The higher "-j" you pass into make, the more machines it will SSH into, copy your code onto, run your code, and rescue the output.
I'm putting in my git repository having dug it out of my JHU NLP Subversion repository. Check it out in my gitweb.
It's not the prettiest thing ever....