Category Archives: Python

Really simple multi-processing in Python – and then linking to MongoDB

Multiprocessing in Python is really easy! You can spawn processes by using the Pool()  class.

Here, we spawn 4 processes, and use the map() function to send a number to each of them.

This is a trivial example, but gets much more powerful when each process does something like making a remote connection:

Markov Transition Matrix Equilibrium made simple in Python

As a refresher on Markov Models, I’ve been watching Professor Scott E. Page’s excellent videos on YouTube. He does all the computation by hand – and now I’ve written some code to perform it faster.

In video 2, we are given a transition matrix of 1 hypothetical student in a classroom transitioning between alertness and boredom:

Alert: t Bored: t
Alert: t+1  0.8  0.25
Bored: t+1  0.2  0.75

This can be represented in Python as:

The vector of students is again, a simple list:

Let’s calculate one stage:

Calling this produces:

Or, let’s try to find the equilibrium state by looping  markov_stage until the values basically stop changing (to a certain decimal place accuracy):

This produces, with an accuracy set at 2 decimal places:

I probably could have written this in Numpy – which would calculate faster using less memory (and probably has built-in functions for the vector-matrix row multiplication), but it was fun just doing this. I’ll try and extend the markov_equilibrium  to give some more detailed stats such as the “churn” as mentioned by Prof. Page.

Custom DefaultDicts with Lambdas in Python for creating detailed frequency tables, or anything else

In Python, there is the dictionary datatype. This is basically a look-up table:

Let’s try and create a frequency table for words though:

This bad code will eventually return a dictionary with entries like:

thing: how_many_times_it_occurred

However, we have to do this try-except statement in case the key doesn’t exist. Defaultdicts get rid of this stage by returning a blank entry (0, empty string, empty list) instead, which is really awesome!

If the key didn’t already exist in our look-up table, then the defaultdict returns an <int> to write the new value! This defaultdict(int)  could be replaced with defaultdict(list)  or any data type.

And now to the crux of the post! We can replace this variable type with a lambda instead, like this:

Now, when the key doesn’t exist, the dictionary will create a new dictionary within! So we can bring another metric into our analysis:

Now our function will return a dictionary that not only lets you know how many times something occurred, but also when it last occurred! Try it out with the following data:

 

Reading really large files in Python

There are many ways to read files in python, which is kinda un-intuitive and against the PEP-8 guidelines. However, this is the best:

It creates a generator which only reads one line at a time into memory. I’ve used this code to process multiple-gigabyte files with ease. Using  with is useful since it automatically closes the file after the code indented below it has finished.

Often people make the mistake of using:

  • for line in f.read()  – this loads the entire file at once and reads per character
  • for line in f.readlines()  – this loads the entire file into a list in memory