Monthly Archives: June 2013

Really simple multi-processing in Python – and then linking to MongoDB

Multiprocessing in Python is really easy! You can spawn processes by using the Pool()  class.

Here, we spawn 4 processes, and use the map() function to send a number to each of them.

This is a trivial example, but gets much more powerful when each process does something like making a remote connection:

Markov Transition Matrix Equilibrium made simple in Python

As a refresher on Markov Models, I’ve been watching Professor Scott E. Page’s excellent videos on YouTube. He does all the computation by hand – and now I’ve written some code to perform it faster.

In video 2, we are given a transition matrix of 1 hypothetical student in a classroom transitioning between alertness and boredom:

Alert: t Bored: t
Alert: t+1  0.8  0.25
Bored: t+1  0.2  0.75

This can be represented in Python as:

The vector of students is again, a simple list:

Let’s calculate one stage:

Calling this produces:

Or, let’s try to find the equilibrium state by looping  markov_stage until the values basically stop changing (to a certain decimal place accuracy):

This produces, with an accuracy set at 2 decimal places:

I probably could have written this in Numpy – which would calculate faster using less memory (and probably has built-in functions for the vector-matrix row multiplication), but it was fun just doing this. I’ll try and extend the markov_equilibrium  to give some more detailed stats such as the “churn” as mentioned by Prof. Page.

Custom DefaultDicts with Lambdas in Python for creating detailed frequency tables, or anything else

In Python, there is the dictionary datatype. This is basically a look-up table:

Let’s try and create a frequency table for words though:

This bad code will eventually return a dictionary with entries like:

thing: how_many_times_it_occurred

However, we have to do this try-except statement in case the key doesn’t exist. Defaultdicts get rid of this stage by returning a blank entry (0, empty string, empty list) instead, which is really awesome!

If the key didn’t already exist in our look-up table, then the defaultdict returns an <int> to write the new value! This defaultdict(int)  could be replaced with defaultdict(list)  or any data type.

And now to the crux of the post! We can replace this variable type with a lambda instead, like this:

Now, when the key doesn’t exist, the dictionary will create a new dictionary within! So we can bring another metric into our analysis:

Now our function will return a dictionary that not only lets you know how many times something occurred, but also when it last occurred! Try it out with the following data:

 

Two quick fixes to make your web browsing more anonymous

Two very easy ways to make your browsing more secure would be to install the following two extensions in Google Chrome:

Why?

  • These block (usually third-party) components which let companies log your browsing data. Check out the Collusion extension to see who knows what. This tries to stop companies easily doing things like (allegedly) increasing airfares and hotel prices based on your browsing history.
  • A nice side effect is that pages may load faster due their being less third-party components to load. Often, sites will add these components to the head of the page, before any other page elements to make sure that the advertising is given a priority to load over the actual page content. 

Caveats:

  • You can still be easily monitored on an ISP level, by a toolbar or if the website is selling off internally logged data.
  • Blocking components using these may “break” some websites by removing essential page-formatting style-sheets and scripts.
  • Blocking advertising from your favorite websites may reduce their advertising income and not let them be rewarded for their traffic/efforts. 

Hypocrisy:

  • I use “statcounter” on this website, which lets me see how many page views there are. StatCounter also aggregates all of its clients data into a very interesting public global stats page. In my opinion, its more useful and trustworthy than Google Analytics. You can block it if you’d like. 
  • Yes, I realize that as a data scientist who works with data not dissimilar to that being collected by these tracking scripts, I’m sort-of reducing the amount of data I get to analyze and putting myself out of a job. But its ethically essential to publish the fact that opt-out is available, for my own peace of mind and also to rebuke scare-mongers such as this ridiculous article.

Reading really large files in Python

There are many ways to read files in python, which is kinda un-intuitive and against the PEP-8 guidelines. However, this is the best:

It creates a generator which only reads one line at a time into memory. I’ve used this code to process multiple-gigabyte files with ease. Using  with is useful since it automatically closes the file after the code indented below it has finished.

Often people make the mistake of using:

  • for line in f.read()  – this loads the entire file at once and reads per character
  • for line in f.readlines()  – this loads the entire file into a list in memory