Skip to main content.

Fri, 08 Feb 2013

Notes from attempting to despam a wiki with git-remote-mediawiki

I just tried to despam a mediawiki instance with git-remote-mediawiki.

The idea is as follows:

git log --since='Wed Dec 5 22:57:06 2012 +0000'

(You can process that with either 'grep ^Author' and so on, or you can use an overwrought Python script I wrote.)

git log --author=bad_user_1 --author=bad_user_2 --pretty="format:%H"

Here's where things start to go wrong.

You might try to revert them all:

git log --author=bad_user_1 --author=bad_user_2 --pretty="format:%H" |
xargs -n1 git revert

That works great until the first merge conflict.

So then you write a wrapper script that does "git revert $1 || git revert --abort", and you can still only revert the first few hundred (out of ~800) spam edits because one of the commits causes a conflict when you try to revert it.

Why a conflict? I suspect it's because there are spam edits that I neglected to include in the revert stream. (Update: The conflict was actually a real conflict -- some kind soul on the web had already reverted a bunch of the spam edits!)

In our case, there are fairly few pages getting spammed, so it'd be simpler to 'git log' the pages we care about and revert back to the commit IDs that look clean. 'git revert' could still be useful in the case of tangled history, but (apparently) there is a limit to how useful it can be, anyway.

Oh, also:

It'd be useful to be able to create MediaWiki dump files from git-remote-mediawiki exports. That way, I could use 'git rebase -i' to clean up history. (That would break links *unless* the MediaWiki revision IDs somehow stayed constant for the revisions with the same content. Maybe that's feasible. Actually, the simplest way might be to write a tool that filters the dump file itself, rather than exporting straight from git-remote-mediawiki.)

Also also, I fixed a format string bug in git mergetool, one of my favorite little pieces of git.

P.S. In this corpus, of the IP address editors (i.e., not logged in), 0 (of 16) are spammers. About 80% of the logged-in editors are spammers. (Admittedly our wiki does require you to log in if you are posting new URLs to a page.)

Update: It is way faster if you run it with low latency to the MediaWiki server in question. It probably could be adjusted to make fewer API calls, and to make more of them in parallel.

[] permanent link and comments

Comment form

  • The following HTML is supported: <a href>, <em>, <i>, <b>, <blockquote>, <br/>, <p>, <abbr>, <acronym>, <big>, <cite>, <code>, <dfn>, <kbd>, <pre>, <small> <strong>, <sub>, <sup>, <tt>, <var>
  • I do not display your email address. It is for my personal use only.

Your email address: 
Your website: