Skip to main content.

Fri, 21 Mar 2008

The ninth graders

On Thursday, March 6, I spoke to a room full of ninth graders at the French-American International High School in San Francisco.

Two months ago, some faculty there emailed the San Francisco Linux Users Group (really, they emailed Jim Stockford, who passed it along). They explained that they were having a full day off classes for students on what they termed "Internet Day," bringing in outside speakers to talk about issues related to computer technology.

One of the most interesting things is that the event was organized by a remarkable high school senior, Joseph Harder. But long story short, I emailed them saying with my background and said I could talk, and they invited me.

So two Thursdays ago, I woke up at 8:30 a.m. and thought, "Man, I should make a presentation for these guys. I have to be a lunch at 11:30 to meet the other presenters." Then I thought better of it. "I'll go to sleep for another half hour."

Now fully prepared (at least as fully as I was going to be), I arrived at lunch only a few minutes late. I met some other presenters: Craig Newmark (famous for his List), a Google laywer, and a Boalt Hall faculty member, to name a few. I felt pretty clearly outclassed, but I figured if I didn't let them know how outclassed I was then at least I could have a normal conversation with them.

My presentation was to (I think) all the ninth graders in the school. There were between 60 and 80 of them in the room, I'd guess. I was co-presenting with Christian Einfeldt, Producer of the Digital Tipping Point film about Free and Open Source Software. Through the Socratic method, he spoke about "ownership," explaining to the students that when you run proprietary software, you are not in control of your computer.

I then spoke about copyright law and what Creative Commons is. I began by giving props to Christian, pointing out that the same Richard Stallman Christan referred to had signed my laptop. (It does turn out that my involvement in Free Software goes deeper than that, but it didn't seem important to list all of my million hats.)

I gave what looks like a very Larry-inspired presentation - sparse slides, and alignment tricks to make my points clearer. It was a total blast. To name only one difference between this presentation and a real presentation by Larry, I treated the slides and my presentation as one half of a conversation; Larry's rock-star Free Culture talks have him whip through things so fast you're mesmerized, but there's no time for questions in the middle. I let the kids ask me questions all through the talk, which was oodles of fun.

You can see the slides here, but the real joy of the event was in the interaction between me and the students. Some highlights:

A beginning

I began by pointing out:

It's just the way the law is written right now.


A conclusion

I explained the exact license NiN chose with the CC symbols. As I finished up, I summarized this as:

Nine Inch Nails: Ghosts I­IV
promises not to sue you
for making a music video and
putting it on YouTube.

One kid asked me, "Is that a legal promise?" This question reaches to the heart of what Creative Commons tries to do - decrease uncertainty about using other people's work when they don't mind. I answered, "Yes," and explained a little of the history of CC.

My last few slides were:

Nine Inch Nails: Ghosts I­IV
respects you.

Which was followed by:

Metallica
does not.

So I got some great applause at the end, which felt marvelous, and then we had about five minutes more for questions. Two questions took up three minutes, and I realized I had forgotten to ask them this question that occurred to me earlier in the morning. So I said:

"Will the people follow the law, or will the law follow the people?"

I tried to get them to take some charge in changing laws that do them more harm than good. Some copyright may be useful, but it's hard to argue we haven't gone too far.

I got just about the same thunder of applause, and then they

Epilogue

As literary convention would have it, this story has an epilogue. I had some good conversations at the end, including discussing Free Software with one Mac user student who pretty clearly knew what he was talking about. Christian spent quite some time talking about his Digital Tipping Point film with the staff member whose name I have still, sadly, forgotten.

I also received an email that looked like this:

Date: Fri, 7 Mar 2008 05:55:24 +0000 (GMT)
From: Student <something@yahoo.fr>
To: asheesh@creativecommons.org
Subject: tech seminar at ihs
hey i rly enjoyed your talk today and i did wat u said and got the NIN
cds... there rly good and i just wanna say thank u a lot and i hope ull be
at the next seminar

Well, that was nice. I'm left with warm fuzzies and the desire to do something like this again.

[/note/free-culture] permanent link and comments

Sun, 16 Mar 2008

Finding duplicate files

Every once in a while, I know one file is duplicated in many places. This happens, for example, when I have imported photos from my camera into a photo management program and also stored a copy of them somewhere else. Sometimes I have downloaded files twice from the web.

Detecting duplicate files is not hard - you just compare the file contents. The problem is that with large files, and a large number of files, it can take a long time if you compare every file to every other file.

Because I needed to do this for a few gigabytes of photos, and everything I found I either didn't trust or ran too slowly, I wrote my own. Once you detect duplicate files, you generally want to either delete all but one, or to "merge them" via hardlinks so that all the files exist, but they share storage space on disk.

Summary: I had a fairly good approach, but everyone should use rdfind instead of my code.

My approach

You can check out (using Subversion or a web browser) my code at http://svn.asheesh.org/svn/public/code/merge_dups/ .

  • Organize all the files grouped by size (since only files of equal size can have equal contents).
  • For each size that contains more than one file, calculate a hash (MD5) of all the files.
    • If any of the files have the same size and MD5, delete the one with a longer filename.
    • Continue to the next file size.

This approach has to stat() every file at least once, but many files don't have to be read at all. For my photos, this was a huge time-saver.

(Why delete the one with a longer filename? Usually that's the one in some obscure directory named "camera-backup" or "recovered-from-some-dying-computer".)

I trust my code. Plus, it is verbose, printing out what it is doing and why. And the entire program with comments, status message print-outs, and vertical spacing easily fits on my screen.

Other implementations

Today, I decided to go through Freshmeat to see if I could retire my code and just rely on someone else's. So I checked out the reasonable contenders from this search.

find_duplicates by Fredrik Hubinette

  • Homepage: http://fredrik.hubbe.net/hacks/
  • License: GPL v2 (good)
  • Efficiency: Good (uses file sizes the way I do)
  • Language: Pike (weird, but seems okay)
  • Strategy: Check sizes; hash; verify by reading file; merge via hardlinks
  • Sanity: High
  • Rating: Good

It uses the first few kilobytes of the string as a hash, which is probably more efficient that reading the whole thing. It is safe and reads the whole files before marking them as duplicates.

dmerge.cpp by Jonathan H Lundquist

  • Homepage: http://www.fluxsmith.com/cgi-bin/twiki/view/Jonathan/DMerge
  • License: X11-like (good)
  • Efficiency: Good (uses file sizes the way I do)
  • Language: C++ (bearable)
  • Strategy: Check sizes; hash by calling an external program; verify by calling "cmp"; ...
  • Sanity: Low
  • Rating: Don't use

I stopped caring when I realized it calls external programs. I doubt it does it in a correct/secure way, so forget it.

duff by Camilla Berglund

  • Homepage: http://duff.sourceforge.net/
  • License: zlib/libpng (MIT-esque) (good)
  • Efficiency: Good
  • Language: C (okay)
  • Strategy: Check sizes; hash with first few bytes; verify by SHA1 or actual
  • Sanity: High (comes with a man page; very tunable; great web site)
  • Rating: Don't use

This looks really good, but it doesn't actually do the merging. It relies on a shell script to do the merging, and I don't trust the correctness of the shell script's handling of filenames (due to the whitespace-separated output format of duff itself).

Note to Camilla: If you provided a -z option (like find -print0) to duff, and made sure the shell script respected it, then it would be practically perfect.

fslint 2.14

  • Efficiency: Seemed lame
  • Rating: Don't use
  • Explanation: I tried it, and then I said, "That's it, I'm writing my own."

It was so litlte fun to use I don't even want to talk about it. The benchmarks on the rdfind web page confirm this with data.

rdfind by Paul Sundvall (WINNER!)

  • Homepage: http://www2.paulsundvall.net/rdfind/rdfind.html
  • License: GPL (v2, probably) (good)
  • Efficiency: Excellent
  • Language: C++ (okay)
  • Strategy: Check sizes; check first bytes; calculate SHA1s; delete dups or create symlinks or create hardlinks or print report
  • Sanity: High - object-oriented, well-commented, includes man page, includes benchmarks
  • Self-importance: High, but seems deserved
  • Rating: Excellent, use it

Finding software like this is why I look for software not written by me.

Other tools I didn't fully review

  • finddup by Heiner Steven <http://www.shelldorado.com/scripts/cmds/finddup>
    • Language: Shell, which probably means it has problems with complicated filenames
  • clink by Michael Opdenacker <http://free-electrons.com/community/tools/utils/clink/>
    • Language: Python (yay!)
    • Does not support hard links, only symlinks, thereby (to the author's own admission) creates permissions problems
  • dupfinder by Matthias Böhm <http://doubles.sourceforge.net/>
    • Sanity: Moderate to Low - thinks that not using hash functions makes it "much faster" than other programs
  • dupmerge2 by Rolf Freitag (continuation of work from Phil Karn) <http://sourceforge.net/projects/dupmerge/>
    • Sanity: Moderate to Low: Bundles a pre-compiled binary, which is just weird
  • dupseek by Antonio Bellezza <http://www.beautylabs.net/software/dupseek.html>
    • Focus on interactive duplicate file removal. Probably good at that; I want correct, unattended operation.
  • freedup by William Stearns <http://freedup.org/>
    • Looks fairly good, even though it's written in bash (freaks me out)
    • Offers an option to strip metadata and compare only file *contents* for MP3, MPEG4, MPC, JPEG, and Ogg (not FLAC, I guess), which is great.

Conclusion

rdfind looks great. Every once in a while, two hours are better spent doing research rather than re-inventing the wheel. This is one of those times where I was more useful to my life as a secretary rather than by trying to be a programmer.

[/note/software] permanent link and comments

Sat, 15 Mar 2008

Crunchy

"Crunchy is an application that formats and delivers html-written Python tutorials inside a browser window, adding interactive elements and snazzy navigation."

It looks good, and it seems at least more complete than "half-baked". I haven't tried it.


[/scribble/code] permanent link and comments

Fri, 07 Mar 2008

Herbert and pajamas

Three nights ago, I went to Pancho Villa at 11 at night for a burrito. I was in my pajamas, and an originally-Polish girl (whose name I forget, sadly) struck up a conversation with me. She explained a minute or so in:

I was thinking of talking to you as soon as I saw you, but I thought you looked very focused on the menu.

She was sweet, and I suggested she wear hers to Pancho Villa. "But they're pink!" she objected.

On the walk back, a boy and a girl who looked in their mid-twenties drove down Fourteenth Street and turned onto Valencia while I was waiting at the corner. The girl's face lit up when she saw my pajamas.

Two nights ago, I was at a cocktail party at the home of the principal of the French-American International High School that I will be speaking at tomorrow. I spoke with the director of maths [sic] at the school, and after about twenty minutes of good conversation, he asked me if I knew I had this little guy in my pocket, or if my kids had put him there while I was unaware. Half an hour later, someone else wondered the same thing to me. Regardless, he was well-received by the school's staff, and in a car ride home, one of the teachers who organized the Internet Day joyfully played with him a little.

I know Herbert makes people happy, but who knew pajamas might be comparable?

[/note/people] permanent link and comments

Wed, 05 Mar 2008

Reasons for doing things

<asheesh> It goes something like, "Because we're good human beings."
<venkatesh> okay. i don't agree with that, particularly

[/note/people] permanent link and comments

Mon, 03 Mar 2008

Interactive ext3 performance

In 2001, drobbins published an article on IBM DeveloperWorks remarking that the data=journal mount option improved interactive performance on one test from ca. 70 seconds to 7 seconds.

Even today, the openSUSE wiki echoes this advice. I wonder if it still holds.

[/note/sysop] permanent link and comments

The 2000-Year-Old Computer (and Other Achievements of Ancient Science)

I went to Ask A Scientist on February 26, and heard an interesting pair of presentations by Richard Carrier at Columbia. The topic was the various achievements of ancient science.

I wrote some notes on the back of my receipt. Direct quotes from him are between quotation marks.

  • He likes the word "kooky".
  • "chunk of junk"
  • Archimedes' Codex is an exemplar of science being overwritten with hymns in the middle ages.
  • On Ptolemy's system of epicycles: "The model worked really well. That's why they were so seduced by it."
  • When asked about politics of science: "I can't off the top of my head think of an interesting story." (Note that he did, in fact, come up with one.)
  • He pronounces "dissection" as if it were "disection."

I also wrote this down, purely my own creation:

  • "Amazing what middle schoolers know that the Ancients did not."

[/note/ask a scientist] permanent link and comments