Monthly Archives: August 2013

Teaching Python at Harvard with Software Carpentry


Mike teaching Hamlet in Python. Photo copyright Chris Erdmann:

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

Saturday’s Teacher line-up

Things that worked well

  • Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.
  • Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:  grep ".*s.*s.*s.*s.*s.*s" words | less
  • Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like append()  vs concatenating with plus signs like first+second , I mentioned things like deque()  and  format() .

Things to remember for next time

  • Typing SLOWLY. I work a lot with MongoDB, so end up typing from pymongo import Connection; c = Connection()  20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.
  • Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

  • You’ll learn a lot from the instructors and student’s questions
  • Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers
  • It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.
  • You’ll meet many like minded people and it feels great to get involved in the community.

What did I learn?

  • The requests library. I normally use urllib2 to grab html from web pages. Urllib2, it turns out, is simply a more extensible library for HTTP requests as shown in this stackoverflow explanation.
  • More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.
  • What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the . split()  of myString.split()  can’t become a variable. Very simple.

Review of Data Science for Business (O’Reilly, 2013)

Book cover

I’m currently participating in the O’Reilly Blogger Review Program – where bloggers are given ebooks of recent publications. 

Data Science for Business fits an interesting gap in the market – managers who want to be able to understand what Data Science is, how to recruit Data Scientists or how to manage a data-oriented team. It says it is also for aspiring Data Scientists, but I would probably recommend Andrew Ng’s Machine Learning course and Codecademy’s intro Python course instead if you’re serious about getting your teeth into the field.

Somewhere between an introduction and an encyclopedia, it gives fairly comprehensive overviews of each sub-field, including distinctions that I hadn’t previously thought of so clearly. The authors are mostly unafraid to explain the maths behind the subjects. It dips into some probability and linear algebra – admittedly with simplified notation. There’s no real mention of implementation (i.e. programming the examples) as one would usually expect with O’Reilly; but most competent readers will now at least know what they’re “looking for” perhaps in terms of packages to install or if they want to try and implement a system from scratch. It is certainly designed for the intelligent, professional and far from popular science.

Whilst it is very thorough and interesting it could touch a nerve among Data Scientists, since should a manager of a Data Scientist really have to read a book such as this – surely in such a position of authority they should know of these techniques already? (an extreme example would be one footnote which even contains a description of what Facebook is, and what it is used for). Often, such unbalanced hierarchies are the cause of much unnecessary stress and complication in the workplace. However, this is often the case so perhaps this will be useful in that context.

I think, overall, I was hoping for a slightly different book – with more in-depth case studies of how to implement existing Data Science knowledge into Business scenarios. Nevertheless, it’s an interesting, intelligent guide in an encyclopedic sense and fairly unique in its clarity of explanation and accessibility – I highly doubt I could write a better guide in that respect. Existing Data Scientists will find many clear analogies to explain their craft to those less technical than themselves and I reckon that by itself justifies taking a look 🙂

When literal_eval fails: using asterisk notation to read in a datetime object

One great function in python is the ast  (Abstract Syntax Tree) library’s literal_eval . This lets you read in a string version of a python datatype:

Importing a dictionary such as this is similar to parsing JSON using Python’s  json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

But this still doesn’t truly fix our datetime object problem:

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class  datetime.datetime(2013, 8, 10, 21, 46, 52, 638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like: 

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!