Fri, 21 Mar 2008
The ninth graders
On Thursday, March 6, I spoke to a room full of ninth graders at the French-American International High School in San Francisco.
Two months ago, some faculty there emailed the San Francisco Linux Users Group (really, they emailed Jim Stockford, who passed it along). They explained that they were having a full day off classes for students on what they termed "Internet Day," bringing in outside speakers to talk about issues related to computer technology.
One of the most interesting things is that the event was organized by a remarkable high school senior, Joseph Harder. But long story short, I emailed them saying with my background and said I could talk, and they invited me.
So two Thursdays ago, I woke up at 8:30 a.m. and thought, "Man, I should make a presentation for these guys. I have to be a lunch at 11:30 to meet the other presenters." Then I thought better of it. "I'll go to sleep for another half hour."
Now fully prepared (at least as fully as I was going to be), I arrived at lunch only a few minutes late. I met some other presenters: Craig Newmark (famous for his List), a Google laywer, and a Boalt Hall faculty member, to name a few. I felt pretty clearly outclassed, but I figured if I didn't let them know how outclassed I was then at least I could have a normal conversation with them.
My presentation was to (I think) all the ninth graders in the school. There were between 60 and 80 of them in the room, I'd guess. I was co-presenting with Christian Einfeldt, Producer of the Digital Tipping Point film about Free and Open Source Software. Through the Socratic method, he spoke about "ownership," explaining to the students that when you run proprietary software, you are not in control of your computer.
I then spoke about copyright law and what Creative Commons is. I began by giving props to Christian, pointing out that the same Richard Stallman Christan referred to had signed my laptop. (It does turn out that my involvement in Free Software goes deeper than that, but it didn't seem important to list all of my million hats.)
I gave what looks like a very Larry-inspired presentation - sparse slides, and alignment tricks to make my points clearer. It was a total blast. To name only one difference between this presentation and a real presentation by Larry, I treated the slides and my presentation as one half of a conversation; Larry's rock-star Free Culture talks have him whip through things so fast you're mesmerized, but there's no time for questions in the middle. I let the kids ask me questions all through the talk, which was oodles of fun.
You can see the slides here, but the real joy of the event was in the interaction between me and the students. Some highlights:
A beginning
I began by pointing out:
- Copyright is Federal law
- Copyright is international law
- Copyright is not necessarily a system of morals
It's just the way the law is written right now.
- I asked, "How did we end up in this terrible mess of copyright law?"
- I then flipped right through the words "disney + world" on their own slide, and switched to a picture of the Magic Kingdom at Disneyworld.
- The kids are oohing and aahing and cheering, "Disney, yay!". I know I won't have their attention until they get this out of their systems, so I just stand around smiling for twenty seconds or so.
- Then, naturally, I explain how Disney movies are largely fairy tales retold, and that Disney lobbies Congress to make sure no one can do to them what they did to to the Brothers Grimm. (A total Larry line, but it's a good one.)
- Kids in the audience actually remembered the original Napster. They must have been in second grade!
- One asked, "Is that the cat?"
- (I said, I guess it's a cat....)
- And many of them actually said they used it, which I found amazing.
- Somewhere toward the middle of the presentation, just as I was finishing explaining how copyright was "worthless" to people who wanted to use and reuse (and participate in!) culture, a staff member whose name sadly escapes me asked me the obvious question: How will people make money?
- I told her I'd get to her question toward the end, when I covered Nine Inch Nails. Turns out that while I said they made $750K of revenue, they actually made 1.6 million dollars in the first week.
A conclusion
- The way I explained Creative Commons was through the comparison of Metallica and Nine Inch Nails.
- Metallica sued Napster, whereas Nine Inch Nails just released a full album (or four, depending how you see it) under a Creative Commons license.
I explained the exact license NiN chose with the CC symbols. As I finished up, I summarized this as:
Nine Inch Nails: Ghosts IIV promises not to sue you for making a music video and putting it on YouTube.
One kid asked me, "Is that a legal promise?" This question reaches to the heart of what Creative Commons tries to do - decrease uncertainty about using other people's work when they don't mind. I answered, "Yes," and explained a little of the history of CC.
My last few slides were:
Nine Inch Nails: Ghosts IIV respects you.
Which was followed by:
Metallica does not.
So I got some great applause at the end, which felt marvelous, and then we had about five minutes more for questions. Two questions took up three minutes, and I realized I had forgotten to ask them this question that occurred to me earlier in the morning. So I said:
"Will the people follow the law, or will the law follow the people?"
I tried to get them to take some charge in changing laws that do them more harm than good. Some copyright may be useful, but it's hard to argue we haven't gone too far.
I got just about the same thunder of applause, and then they
Epilogue
As literary convention would have it, this story has an epilogue. I had some good conversations at the end, including discussing Free Software with one Mac user student who pretty clearly knew what he was talking about. Christian spent quite some time talking about his Digital Tipping Point film with the staff member whose name I have still, sadly, forgotten.
I also received an email that looked like this:
Date: Fri, 7 Mar 2008 05:55:24 +0000 (GMT) From: Student <something@yahoo.fr> To: asheesh@creativecommons.org Subject: tech seminar at ihs
hey i rly enjoyed your talk today and i did wat u said and got the NIN cds... there rly good and i just wanna say thank u a lot and i hope ull be at the next seminar
Well, that was nice. I'm left with warm fuzzies and the desire to do something like this again.
[/note/free-culture] permanent link
Sun, 16 Mar 2008
Finding duplicate files
Every once in a while, I know one file is duplicated in many places. This happens, for example, when I have imported photos from my camera into a photo management program and also stored a copy of them somewhere else. Sometimes I have downloaded files twice from the web.
Detecting duplicate files is not hard - you just compare the file contents. The problem is that with large files, and a large number of files, it can take a long time if you compare every file to every other file.
Because I needed to do this for a few gigabytes of photos, and everything I found I either didn't trust or ran too slowly, I wrote my own. Once you detect duplicate files, you generally want to either delete all but one, or to "merge them" via hardlinks so that all the files exist, but they share storage space on disk.
Summary: I had a fairly good approach, but everyone should use rdfind instead of my code.
My approach
You can check out (using Subversion or a web browser) my code at http://svn.asheesh.org/svn/public/code/merge_dups/ .
- Organize all the files grouped by size (since only files of equal size can have equal contents).
- For each size that contains more than one file, calculate a hash (MD5) of all the files.
- If any of the files have the same size and MD5, delete the one with a longer filename.
- Continue to the next file size.
This approach has to stat() every file at least once, but many files don't have to be read at all. For my photos, this was a huge time-saver.
(Why delete the one with a longer filename? Usually that's the one in some obscure directory named "camera-backup" or "recovered-from-some-dying-computer".)
I trust my code. Plus, it is verbose, printing out what it is doing and why. And the entire program with comments, status message print-outs, and vertical spacing easily fits on my screen.
Other implementations
Today, I decided to go through Freshmeat to see if I could retire my code and just rely on someone else's. So I checked out the reasonable contenders from this search.
find_duplicates by Fredrik Hubinette
- Homepage: http://fredrik.hubbe.net/hacks/
- License: GPL v2 (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: Pike (weird, but seems okay)
- Strategy: Check sizes; hash; verify by reading file; merge via hardlinks
- Sanity: High
- Rating: Good
It uses the first few kilobytes of the string as a hash, which is probably more efficient that reading the whole thing. It is safe and reads the whole files before marking them as duplicates.
dmerge.cpp by Jonathan H Lundquist
- Homepage: http://www.fluxsmith.com/cgi-bin/twiki/view/Jonathan/DMerge
- License: X11-like (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: C++ (bearable)
- Strategy: Check sizes; hash by calling an external program; verify by calling "cmp"; ...
- Sanity: Low
- Rating: Don't use
I stopped caring when I realized it calls external programs. I doubt it does it in a correct/secure way, so forget it.
duff by Camilla Berglund
- Homepage: http://duff.sourceforge.net/
- License: zlib/libpng (MIT-esque) (good)
- Efficiency: Good
- Language: C (okay)
- Strategy: Check sizes; hash with first few bytes; verify by SHA1 or actual
- Sanity: High (comes with a man page; very tunable; great web site)
- Rating: Don't use
This looks really good, but it doesn't actually do the merging. It relies on a shell script to do the merging, and I don't trust the correctness of the shell script's handling of filenames (due to the whitespace-separated output format of duff itself).
Note to Camilla: If you provided a -z option (like find -print0) to duff, and made sure the shell script respected it, then it would be practically perfect.
fslint 2.14
- Efficiency: Seemed lame
- Rating: Don't use
- Explanation: I tried it, and then I said, "That's it, I'm writing my own."
It was so litlte fun to use I don't even want to talk about it. The benchmarks on the rdfind web page confirm this with data.
rdfind by Paul Sundvall (WINNER!)
- Homepage: http://www2.paulsundvall.net/rdfind/rdfind.html
- License: GPL (v2, probably) (good)
- Efficiency: Excellent
- Language: C++ (okay)
- Strategy: Check sizes; check first bytes; calculate SHA1s; delete dups or create symlinks or create hardlinks or print report
- Sanity: High - object-oriented, well-commented, includes man page, includes benchmarks
- Self-importance: High, but seems deserved
- Rating: Excellent, use it
Finding software like this is why I look for software not written by me.
Other tools I didn't fully review
- finddup by Heiner Steven <http://www.shelldorado.com/scripts/cmds/finddup>
- Language: Shell, which probably means it has problems with complicated filenames
- clink by Michael Opdenacker <http://free-electrons.com/community/tools/utils/clink/>
- Language: Python (yay!)
- Does not support hard links, only symlinks, thereby (to the author's own admission) creates permissions problems
- dupfinder by Matthias Böhm <http://doubles.sourceforge.net/>
- Sanity: Moderate to Low - thinks that not using hash functions makes it "much faster" than other programs
- dupmerge2 by Rolf Freitag (continuation of work from Phil Karn) <http://sourceforge.net/projects/dupmerge/>
- Sanity: Moderate to Low: Bundles a pre-compiled binary, which is just weird
- dupseek by Antonio Bellezza <http://www.beautylabs.net/software/dupseek.html>
- Focus on interactive duplicate file removal. Probably good at that; I want correct, unattended operation.
- freedup by William Stearns <http://freedup.org/>
- Looks fairly good, even though it's written in bash (freaks me out)
- Offers an option to strip metadata and compare only file *contents* for MP3, MPEG4, MPC, JPEG, and Ogg (not FLAC, I guess), which is great.
Conclusion
rdfind looks great. Every once in a while, two hours are better spent doing research rather than re-inventing the wheel. This is one of those times where I was more useful to my life as a secretary rather than by trying to be a programmer.
[/note/software] permanent link
Sat, 15 Mar 2008
Crunchy
It looks good, and it seems at least more complete than "half-baked". I haven't tried it.
[/scribble/code] permanent link
Fri, 07 Mar 2008
Herbert and pajamas
Three nights ago, I went to Pancho Villa at 11 at night for a burrito. I was in my pajamas, and an originally-Polish girl (whose name I forget, sadly) struck up a conversation with me. She explained a minute or so in:
I was thinking of talking to you as soon as I saw you, but I thought you looked very focused on the menu.
She was sweet, and I suggested she wear hers to Pancho Villa. "But they're pink!" she objected.
On the walk back, a boy and a girl who looked in their mid-twenties drove down Fourteenth Street and turned onto Valencia while I was waiting at the corner. The girl's face lit up when she saw my pajamas.
Two nights ago, I was at a cocktail party at the home of the principal of the French-American International High School that I will be speaking at tomorrow. I spoke with the director of maths [sic] at the school, and after about twenty minutes of good conversation, he asked me if I knew I had this little guy in my pocket, or if my kids had put him there while I was unaware. Half an hour later, someone else wondered the same thing to me. Regardless, he was well-received by the school's staff, and in a car ride home, one of the teachers who organized the Internet Day joyfully played with him a little.
I know Herbert makes people happy, but who knew pajamas might be comparable?
Wed, 05 Mar 2008
Reasons for doing things
<asheesh> It goes something like, "Because we're good human beings." <venkatesh> okay. i don't agree with that, particularly
Mon, 03 Mar 2008
Interactive ext3 performance
In 2001, drobbins published an article on IBM DeveloperWorks remarking that the data=journal mount option improved interactive performance on one test from ca. 70 seconds to 7 seconds.
Even today, the openSUSE wiki echoes this advice. I wonder if it still holds.
The 2000-Year-Old Computer (and Other Achievements of Ancient Science)
I went to Ask A Scientist on February 26, and heard an interesting pair of presentations by Richard Carrier at Columbia. The topic was the various achievements of ancient science.
I wrote some notes on the back of my receipt. Direct quotes from him are between quotation marks.
- He likes the word "kooky".
- "chunk of junk"
- Archimedes' Codex is an exemplar of science being overwritten with hymns in the middle ages.
- On Ptolemy's system of epicycles: "The model worked really well. That's why they were so seduced by it."
- When asked about politics of science: "I can't off the top of my head think of an interesting story." (Note that he did, in fact, come up with one.)
- He pronounces "dissection" as if it were "disection."
I also wrote this down, purely my own creation:
- "Amazing what middle schoolers know that the Ancients did not."