Tue, 22 Dec 2009
This year's Python conference takes place February 19-21 in Atlanta, Georgia, USA.
This year is the first year PyCon is holding a poster session. My poster is on open source and Free Software for the Python community, focusing on how you can get involved.
It's a plenery session. This means, for 90 minutes, there will be a dozen of us presenters standing in front of our posters hoping PyCon attendees will talk to us. Everyone at PyCon will be milling about, since there will be no talks during the poster session. So stop by!
Web scraping tutorial
I had lots of fun last year talking to a packed room about programming the web. The World-Wide Web is the world's most widely-used distributed computing system; if you're only using it from a web browser, you're missing out. It's a tutorial, which is a paid three-hour course (with refreshments) in a classroom setting. Based on what last year's attendees said afterward at lunch, it seemed the attendees enjoyed themselves too!
From Python, there's a host of choices for pulling information from the web, and a few choices for pushing data back (usually through forms). Here are some topics we'll cover:
- The sorry (if humorous) state of standards on the web
- Evaluating web page parsing engines
- Why you don't want to use regular expresions (and how you can anyway)
- Submitting to forms
I think the most exciting part is the discussion of getting around anti-scraping countermeasures. This is where the rubber hits the road. We'll:
- Write Python code that automatically submits comments to a WordPress blog protected by WP Hashcash
- Choose a user-agent so Google doesn't block us immediately
- Look at deployed software that has built-in CAPTCHA solving
- Compare the effectiveness of different countermeasures
- Automate a completely AJAX website by mechanizing Firefox itself
Last year's version is online as a video. If you missed it last year, register for PyCon and sign up for my tutorial, "Scrape the Web." You're likely to learn a lot, and I'm always happy to answer questions during and afterward.
Brian Gershon, one of last year's attendees, explained best:
Why use an API, when you can just grab it off the page? :)