Tue, 21 Oct 2008
Hixie Limerick
RDFa is a standard now. Meanwhile, Ian Hixie is considering how (and if) to make this part of HTML5. To celebrate, I wrote this Limerick as a vision from the future.
There once was a fellow named Hixie,
Our pal Joi lent him a fixie.
He took HTML5,
Added RDF jive,
And returned positively a-rixie!
Sun, 19 Oct 2008
Packaging, and other joys of Debconf
I was trying to explain to my friend Emily (C.) some of the fun things about Debconf.
On one of the first days I attended, I was standing around while some people I didn't yet know discussed piuparts, an automated Debian package tester.
At this point when talking to Emily, I thought, Maybe I shouldn't bother explaining what piuparts is. If I do explain it, it will make me much more interested in the telling of the story, as well as let her make sense of the story. Or I could be vague to avoid boring her, but then I'll bore myself by only teling the skeleton of a story.
I know Emily well enough that she'll forgive me boring her, I decided. So I'll give it a try.
"The bulk of work in Debian is packaging, which means finding up-to-date open source software and bundling it up into a nice installer," I began. "Windows installers, if you're lucky, will create an entry in Add/Remove Programs. But Debian installers, to comply with Debian Policy, have to do a lot more."
"Let's say you already had the Safari web browser installed and you wanted to install Google Chrome, their new browser based on the same core as Safari. When you upgrade Safari, it would be nice if Google Chrome also benefitted from the upgrade."
"In Debian, it would." I continued with another obscure fact about the Debian Policy. "Another element of the Policy is that when a package is fully removed, it must leave no files and leave no programs running."
Suddenly she was interested! For a moment I didn't understand why. Then I realized what I had said: Something I take for granted in Debian, the "leave no trace" element, is something Windows users often wish they had.
I continued, "There is an automated tool called piuparts which takes packages, creates a small virtual install of Debian, run the package's installer, the uninstalls it and verifies that the package does in fact leave no trace."
Explaining the rest was easy: The first day I was at Debconf, I ran into some people discussing piuparts. Lucas explained he was slow to program in Python, the language piuparts is written in, and Emily rightly picked up on the fact that Python is my favorite programming language. Lucas explained that piuparts needed a machine-readable report format so that you could automatically run it on the whole Debian archive and get a list of which packages have problems. I volunteered to add that.
After a few days of hardly working on this, I finally was sitting with some new friends Thursday night. They left, and I worked on everything I could possibly justify working on. Then it was 5 a.m., and I knew there was no more time to waste if I wanted to actually finish the modification to piuparts. So I began it, ate breakfast, and finished it.
It was really great having a comfortable environment to work all night in. It was even better that I had people to stay up late talking to about geeky things that came from a shared interest in programming, system administration, and Free Software principles. When people left, there was always a great reason to stay awake: more great people to talk to, or finally the assignment I gave myself at the start of Debconf. I had that joyous feeling from the people every evening at Debconf, and Thursday night the feeling brought me all the way to morning.
Wed, 15 Oct 2008
Mining Wikipedia for style edits
I had an old project that never quite succeeded to mine Wikipedia for style edits — through this, one could learn what made a style improvement, and attempt to generalize that to other texts. Think of it as the minimal boostrapping of a purely-statistical "grammar checker." It was originally to be my masters project under the awesome Jason Eisner. (Instead, I contributed to a storage layer branch of Dyna's compiler.)
It has some nice filters (written using SAX, so they don't chew up all your RAM for pages with 2GB of history) for filtering down MediaWiki page dumps into just what we want, also optionally modifying their text so that the new outputted versions contain the results of data processing.
The code also has some hilarious (and useful!) Makefiles that treat a bunch of heterogenous computers as a compute cluster. The higher "-j" you pass into make, the more machines it will SSH into, copy your code onto, run your code, and rescue the output.
I'm putting in my git repository having dug it out of my JHU NLP Subversion repository. Check it out in my gitweb.
It's not the prettiest thing ever....
Sun, 28 Sep 2008
Colorizing standard error: Adventures in LD_PRELOAD
Kristian again asked an interesting question on the SF-LUG mailing list. This time, it was: "How can one get stderr and stdout to appear in different colors?" He was asking on behalf of someone, in turn on behalf of a Java programmer.
I thought about this and discussed it with Jesse Zbikowski, who I happened to be sitting next to at the Tenderloin Computer Help Day that Christian Einfeldt invited the list to (which turned out to be a lot more interesting and orderly than I had imagined!).
Jesse and I talked and we thought of named pipes, which Jesse got to work on and produced a nice Perl tool for. I thought about LD_PRELOAD and got off to a few false starts, and finally came up with a tool I called stderred (tarball of v1.2). It includes a demo program in Java and a README.
LD_PRELOAD
LD_PRELOAD wrappers are a way to change the way a program executes by replacing library functions, like write() or gettimeofday(), with your own homebrew versions. You can think of the dynamic linker as allowing you to stack your own things "above" the C library, but "below" the actual program that runs. So in looking for a symbol (a function name, typically), the program searches down until it finds it, and uses that.
"stderred" is a C program and a Makefile that you can demonstrate works properly; it includes a sample Java program and a README. Because it intercepts the Java JRE's calls to write() to write out messages to stdout, stderr, or whatever, and only modifies the ones to stderr, it should be safe to use everywhere. Plus there are no race conditions; it runs right in the context of the program, so it also avoids the performance penalty of context switches.
This LD_PRELOAD wrapper is interesting, I think, because (thanks to Eric Northup for the idea) it calls the real system write() function by yanking it out of libc using dlopen()+dlsym(). I was also (you can see this in the first few revisions) trying a #define hack to get access to libc definitions without the real symbols; however, this failed a link-time. I don't see how it could work.
The problem with named pipes: Buffering can change the order of outputted lines
Jesse pointed out to me that the named pipe approach has a serious buffering issue related to timing: if the process writes to stderr and stdout in quick succession, the lines could appear colorized in the wrong order. Jesse shows me some variations of his script that changed which wrong order it generated, but we couldn't quite figure out how to make it always right. This seems like a race condition to me.
That's because when the named pipe in question is read from, the Perl script doesn't know *how much* to read. So in this case:
one line to stderr
one line to stdout
one line to stderr
After Jesse explained this to me a few times, I understood it would get printed as either:
one line to stdout
one line to stderr
one line to stderr
or the same with stderr's lines on top. Note that the interweaving is gone; this is because the information of how *much* was printed each time is thrown away by the OS. Because the read()s are happening out-of-process in both the ZSH and Perl ways to do this, I don't see how they could get around this issue. An implementation based on select() or epoll() would have the same issues, I believe.
Why my solution doesn't work for "ls"
stderred is as simple as it is because it only overrides write(). The JRE only seems to use write(), not any of the helper functions like straight-up printf(), or error(), or fprintf(), that also write to file descriptors. Unfortunately, if you try to stderred-ify "ls", none of stderr appears red! That's because ls uses fprintf_unlocked() and error(), which themselves *inside libc* call write().
If you think of ls as standing on top of a library stack that looks like this:
ls
[stderred]
[libc]
if you know that symbol resolution only looks "down," it's clear that the functions *inside libc* don't go back *up* to stderred to find my hacked write(). So they use the libc write(), which doesn't colorize.
Therefore, I started down the long road of modifying "all the important" functions to colorize if the output was going to stderr. Trying to colorize "ls" is where I started, so I wrote quite a few of those before actually checking what Java used. "ls" nearly gets colorized properly; you can look through the with_error branch for the latest work down that path. But I stopped once I figured out Java seems okay with just write(), and for cleanliness's sake I left that out of the released version (currently 1.1). Patches welcome!
zsh, python, and further reading
According to the Gentoo-Wiki, zsh users have an easy way to enable colorizing stderr. Knowing little about zsh but something about UNIX, it seems to me when they fork to run the new program, they close() fd #2 (stderr) and open it as a pipe to this program. I don't see how they solve the races brought up by the Perl thing; it seems to me they'd have the same race.
This is the same path that Jesse and I started down in the beginning; we read http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html and noticed it didn't discuss setting stderr to a pipe, and then we talked about named pipes....
The Pythonic way to do this would have been to "simply" globally override what "sys.stderr" is. I don't know if such a thing is possible in Java.
You can read a quick tutorial on LD_PRELOAD in the IBM DeveloperWorks article, "Override the GNU C Library -- Painlessly." You can read a lot more about dynamic linking in the exhaustive "How To Write Shared Libraries" by Ulrich Drepper.
Fri, 06 Jun 2008
Build failure
A package I am working on fails to build. Mako helps me understand why:
<mako> problem seems to be liboobs <paulproteus> I'm afraid you're trolling me. <mako> i wish i was
Sun, 16 Mar 2008
Finding duplicate files
Every once in a while, I know one file is duplicated in many places. This happens, for example, when I have imported photos from my camera into a photo management program and also stored a copy of them somewhere else. Sometimes I have downloaded files twice from the web.
Detecting duplicate files is not hard - you just compare the file contents. The problem is that with large files, and a large number of files, it can take a long time if you compare every file to every other file.
Because I needed to do this for a few gigabytes of photos, and everything I found I either didn't trust or ran too slowly, I wrote my own. Once you detect duplicate files, you generally want to either delete all but one, or to "merge them" via hardlinks so that all the files exist, but they share storage space on disk.
Summary: I had a fairly good approach, but everyone should use rdfind instead of my code.
My approach
You can check out (using Subversion or a web browser) my code at http://svn.asheesh.org/svn/public/code/merge_dups/ .
- Organize all the files grouped by size (since only files of equal size can have equal contents).
- For each size that contains more than one file, calculate a hash (MD5) of all the files.
- If any of the files have the same size and MD5, delete the one with a longer filename.
- Continue to the next file size.
This approach has to stat() every file at least once, but many files don't have to be read at all. For my photos, this was a huge time-saver.
(Why delete the one with a longer filename? Usually that's the one in some obscure directory named "camera-backup" or "recovered-from-some-dying-computer".)
I trust my code. Plus, it is verbose, printing out what it is doing and why. And the entire program with comments, status message print-outs, and vertical spacing easily fits on my screen.
Other implementations
Today, I decided to go through Freshmeat to see if I could retire my code and just rely on someone else's. So I checked out the reasonable contenders from this search.
find_duplicates by Fredrik Hubinette
- Homepage: http://fredrik.hubbe.net/hacks/
- License: GPL v2 (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: Pike (weird, but seems okay)
- Strategy: Check sizes; hash; verify by reading file; merge via hardlinks
- Sanity: High
- Rating: Good
It uses the first few kilobytes of the string as a hash, which is probably more efficient that reading the whole thing. It is safe and reads the whole files before marking them as duplicates.
dmerge.cpp by Jonathan H Lundquist
- Homepage: http://www.fluxsmith.com/cgi-bin/twiki/view/Jonathan/DMerge
- License: X11-like (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: C++ (bearable)
- Strategy: Check sizes; hash by calling an external program; verify by calling "cmp"; ...
- Sanity: Low
- Rating: Don't use
I stopped caring when I realized it calls external programs. I doubt it does it in a correct/secure way, so forget it.
duff by Camilla Berglund
- Homepage: http://duff.sourceforge.net/
- License: zlib/libpng (MIT-esque) (good)
- Efficiency: Good
- Language: C (okay)
- Strategy: Check sizes; hash with first few bytes; verify by SHA1 or actual
- Sanity: High (comes with a man page; very tunable; great web site)
- Rating: Don't use
This looks really good, but it doesn't actually do the merging. It relies on a shell script to do the merging, and I don't trust the correctness of the shell script's handling of filenames (due to the whitespace-separated output format of duff itself).
Note to Camilla: If you provided a -z option (like find -print0) to duff, and made sure the shell script respected it, then it would be practically perfect.
fslint 2.14
- Efficiency: Seemed lame
- Rating: Don't use
- Explanation: I tried it, and then I said, "That's it, I'm writing my own."
It was so litlte fun to use I don't even want to talk about it. The benchmarks on the rdfind web page confirm this with data.
rdfind by Paul Sundvall (WINNER!)
- Homepage: http://www2.paulsundvall.net/rdfind/rdfind.html
- License: GPL (v2, probably) (good)
- Efficiency: Excellent
- Language: C++ (okay)
- Strategy: Check sizes; check first bytes; calculate SHA1s; delete dups or create symlinks or create hardlinks or print report
- Sanity: High - object-oriented, well-commented, includes man page, includes benchmarks
- Self-importance: High, but seems deserved
- Rating: Excellent, use it
Finding software like this is why I look for software not written by me.
Other tools I didn't fully review
- finddup by Heiner Steven <http://www.shelldorado.com/scripts/cmds/finddup>
- Language: Shell, which probably means it has problems with complicated filenames
- clink by Michael Opdenacker <http://free-electrons.com/community/tools/utils/clink/>
- Language: Python (yay!)
- Does not support hard links, only symlinks, thereby (to the author's own admission) creates permissions problems
- dupfinder by Matthias Böhm <http://doubles.sourceforge.net/>
- Sanity: Moderate to Low - thinks that not using hash functions makes it "much faster" than other programs
- dupmerge2 by Rolf Freitag (continuation of work from Phil Karn) <http://sourceforge.net/projects/dupmerge/>
- Sanity: Moderate to Low: Bundles a pre-compiled binary, which is just weird
- dupseek by Antonio Bellezza <http://www.beautylabs.net/software/dupseek.html>
- Focus on interactive duplicate file removal. Probably good at that; I want correct, unattended operation.
- freedup by William Stearns <http://freedup.org/>
- Looks fairly good, even though it's written in bash (freaks me out)
- Offers an option to strip metadata and compare only file *contents* for MP3, MPEG4, MPC, JPEG, and Ogg (not FLAC, I guess), which is great.
Conclusion
rdfind looks great. Every once in a while, two hours are better spent doing research rather than re-inventing the wheel. This is one of those times where I was more useful to my life as a secretary rather than by trying to be a programmer.
Tue, 26 Feb 2008
Repository
I run a Debian package repository, with packages for Debian at http://www.asheesh.org/debian/ and packages for Ubuntu at http://www.asheesh.org/ubuntu/.
If you would like to use my repository, here is what you need to add:
For Ubuntu
- Edit sources.list and add these lines:
deb http://www.asheesh.org/ubuntu/ hardy main deb-src http://www.asheesh.org/ubuntu/ hardy main
- gpg --keyserver pgpkeys.mit.edu --recv-key 0x70096AD1 ; gpg -a --export 0x70096AD1 | sudo apt-key add -
- sudo apt-get update
For Debian
- Edit sources.list and add these lines:
deb http://www.asheesh.org/debian/ sid main deb-src http://www.asheesh.org/debian/ sid main
- gpg --keyserver pgpkeys.mit.edu --recv-key 0x70096AD1 ; gpg -a --export 0x70096AD1 | sudo apt-key add -
- sudo apt-get update
And then?
You're free to apt-get install whatever you want from my repositories now. If you would like to compile my packages from source, just do:
- apt-src -bi install $package
Fri, 11 Jan 2008
GUIs with PyGTK and Glade
Today I had to write a simple GUI program on a deadline, so I thought I would try Glade and PyGTK.
Wow, that was easy. I read through a straightforward LinuxJournal article and now I feel fairly comfortable with this for single-window applications. The thing I'm not really sure of is how to change the window the program displays, but maybe one day I will. (-:
Update: There's a GLADE to Python code generator called Kefir that looks nice too.
Mon, 24 Dec 2007
Git repository for Qtopia
(Cross-posted at http://qtopia.net/modules/newbb_plus/viewtopic.php?topic_id=593&forum=1 and on the OpenMoko device owners list).
In order to make it easier to track updates to the Qtopia 4.3.1 snapshots, I made a git repository out of them.
What I'm doing is, automatically (every night), untarring the snapshots into a git repository at git://git.asheesh.org/qtopia_snapshot.git , which is readable in a gitweb at http://git.asheesh.org/?p=qtopia_snapshot.git .
Note that many snapshots contain the same contents; the automatic script only commits if the snapshots contains some new data.
My current primary interest in the Qtopia GPL edition is for my HTC Universal, which runs it very nicely including sleep and wake-up, SMS, and voice calls. My interest in the git repositories lies from wanting to publish a modified version of Qtopia that's easy to merge changes into as Trolltech updates their code. I'm sure there are Neo1973 users who would like to hack on the Trolltech code or have an easy source repository from which to get updates.
http://www.handhelds.org/moin/moin.cgi/Qtopia is where I'll be posting any is where I'll be posting any updates I have. If you're working on a fork of Qtopia, I'd love to give out git commit access so you can publish a branch on my git if you like.
Similarly, if people want to publish their Qtopia-based applications in a git repository, just ask me!
Fri, 14 Dec 2007
Beautiful attacks
In Python, if you use smart libraries like SQLObject, SQLAlchemy, and Kid, you can't generate invalid SQL or HTML, so you're not vulnerable to SQL injection, cross-site scripting, and other attacks derived from input validation problems.
Unless you're really smart:
This is a heavily wrapped, heavily abstracted version of SQL injection attacks.
Just a reminder that attacks against syntax aren't all you need to stop; attacks against semantics can be bad, too.