Using Web Workers in Firefox Extensions

Web Workers allow you to run code in the background in browsers such as Firefox. This is how to build one into a Firefox Extension, which is slightly different than from just creating one as normal on a page. The documentation for doing this is basically non-existent, so hopefully you’ll find this useful.

Please make sure you have a development environment set up similar to the one described here in my previous post.

How do workers work?

  • Workers in /data/ are not directly connected to scripts in /lib/
  • However, they can communicate by sending messages to each other
  • These messages are text only, so could contain serialized JSON, but nothing else
  • You’ll notice below that we are basically just slinging messages between two scripts

The code for the worker

Navigate to the /data/ directory and create a file called hello_world.js

Now paste the following in there (new users of vim, press i  to start typing and Esc  followed by :wq  to save):

This says that whenever the worker receives a message from the client, then send a message back with the word “Hello” prepended.

One note here: In workers, you can’t use the useful function  console.log("message") , instead use  dump("message")

Let’s call the worker from the main code

Let’s navigate back to the /lib/  folder and edit the main.js  file, which is the first thing that runs in the extension.

Paste in the following code:

And run  cfx run . You’ll notice a messy error:

Aha! The key line here is:  ReferenceError: Worker is not defined . This is because Firefox Extensions use something called a ChromeWorker instead. We need to import this in main.js by pasting this at the top:

and changing the line that references the hello_world.js file to call a ChromeWorker instead:

Ok let’s try running it again! Try  cfx run . Wtf another error?!

The key line here is:  Malformed script URI: hello_world.js . This cryptic error is because firefox can’t yet access anything in the /data/  folder. We have to use another part of the SDK to enable access to it.

Open main.js  and put this at the top:

Now we can use the function self.data.url() . When you put a filename as the first argument, it will return a string like  resource://jid1-zmowxggdley0aa-at-jetpack/test/data/whatever_file.js which properly refers to it in the case of extensions. Modify the worker import line as follows:

Now let’s run the extension again using cfx run :

Yay it works! The Worker returned the message “Hello Matthew“.

FAQ

  • What does this {notation}  mean?

It is shorthand for:

Basically this means that  require("chrome") returns an Object, and we just need the value that is referenced by the key “ChromeWorker”. This is a very succinct way of extracting things from JavaScript Objects that will come in handy in the future.

  • Why is Worker now called ChromeWorker? Are we doing something with Google Chrome?

This is a naming coincidence and nothing to do with Chrome as in the browser. Chrome in this case refers to Firefox Addon internals.

Setting up a development environment for Firefox Extensions.

This is the method I use to create simple firefox extensions. This tutorial is a precursor to the next one which is about using Web Workers (i.e. allowing code to run on background threads).

Setting up the environment

We’re going to need the Firefox Addon SDK. This is a collection of python files that will let you run a test (optionally blank) version of Firefox. To download it:

Now extract it and remove the tarball:

Go to the directory and startup the special shell:

Now you can see that the shell has prepended (addon-sdk-1.17) in brackets to the prompt. This means that the window is probably half filled with text so we can reduce that with the command:

Much cleaner! :)

Setting up the extension template

Now that we have this special addon-sdk shell, navigate back to your documents and create a new folder for our extension.

This special shell has various useful commands included, which all look like  cfx xyz . For more about them see here. In this case we use  cfx init

Let’s inspect what was created:

  •  lib  contains a file called main.js  which is the main handler file for all extension code
  • data  is empty but can be used to store things like workers (which we will come to later) or large data files
  • test  can contain unit tests (quite hard to set up but useful for test driven development later)
  • package.json  contains metadata about the extension – version number, name of the creator, description, licensing etc

You can start writing code in main.js and it will run in the browser. Once finished, use  cfx run to test it!

See the next tutorial on how to write a firefox extension using web workers!

MongoDB aspirin

Basically this is a list in progress of common errors/tasks/gripes I get when using Mongo. I’ve noted down what usually works. Maybe you’ll find it useful :-)

Why has my remote mongodb connection been refused?

    • Delete the mongod.lock  file from your main mongodb storage folder and restart [SO]
    • If on redhat, check  sestatus
    • Modify /etc/sysconfig/iptables   to have the correct firewall rules according to the mongodb docs.

How do can you iterate through all mongoDB collections in pymongo? 

I normally access collections as object attributes like:   conn.Database.Collection.find_one() but actually databases and connections can be accessed as keys in a dictionary as well:

Why is mongod terminating whenever I close the shell? Even when using &  at the end

When starting mongod, use mongod --fork (note: fork must be right after the word mongod) and it will start as a background process instead. Or just add fork = true  to your config.

I just created an authenticated database and can’t even use show dbs !

Create a new user with all 4 of the following permissions: userAdminAnyDatabase, readWriteAnyDatabase, dbAdminAnyDatabase, clusterAdmin:

Shared Variables in Python Multiprocessing to pre-map/reduce

I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:

Into a much nicer:

Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the "i"  in Value("i", 0)  is the name of the variable. Actually, its a typecode which stands for “integer”.

There are other ways to do this, however, each of which has its own trade offs:

# Solution Advantages Disadvantages
1 Shared file Easy to implement and access after Very slow
2 Shared mongoDB document Easy to implement Slow to constantly query for it
3 Multiprocessing Value/Array (this example) Very fast, easy to implement On 1 PC only, can’t be accessed after process is killed
4 Memcached Shared Value Networked aspect is useful for big distributed databases, shared.set() function is already available TCP could slow you down a bit

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):

@ATTRIBUTE f0 NUMERIC
@ATTRIBUTE f1 NUMERIC
@ATTRIBUTE f2 NUMERIC
@ATTRIBUTE f3 NUMERIC

@ATTRIBUTE fn NUMERIC
@DATA
{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:

 Limitations

This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

WNYC Radio: “Are Hackathons Worth It?”

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that. I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use. So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So this level of security and privacy that users want or need is just so inconvenient to achieve its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

Teaching Python at Harvard with Software Carpentry

BScO7AGIMAAvVZj.jpg-medium

Mike teaching Hamlet in Python. Photo copyright Chris Erdmann: https://twitter.com/libcce/status/371281901191196672

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

Saturday’s Teacher line-up

Things that worked well

  • Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.
  • Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:  grep ".*s.*s.*s.*s.*s.*s" words | less
  • Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like append()  vs concatenating with plus signs like first+second , I mentioned things like deque()  and  format() .

Things to remember for next time

  • Typing SLOWLY. I work a lot with MongoDB, so end up typing from pymongo import Connection; c = Connection()  20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.
  • Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

  • You’ll learn a lot from the instructors and student’s questions
  • Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers
  • It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.
  • You’ll meet many like minded people and it feels great to get involved in the community.

What did I learn?

  • The requests library. I normally use urllib2 to grab html from web pages. Urllib2, it turns out, is simply a more extensible library for HTTP requests as shown in this stackoverflow explanation.
  • More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.
  • What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the . split()  of myString.split()  can’t become a variable. Very simple.

Review of Data Science for Business (O’Reilly, 2013)

Book cover

I’m currently participating in the O’Reilly Blogger Review Program – where bloggers are given ebooks of recent publications. 

Data Science for Business fits an interesting gap in the market – managers who want to be able to understand what Data Science is, how to recruit Data Scientists or how to manage a data-oriented team. It says it is also for aspiring Data Scientists, but I would probably recommend Andrew Ng’s Machine Learning course and Codecademy’s intro Python course instead if you’re serious about getting your teeth into the field.

Somewhere between an introduction and an encyclopedia, it gives fairly comprehensive overviews of each sub-field, including distinctions that I hadn’t previously thought of so clearly. The authors are mostly unafraid to explain the maths behind the subjects. It dips into some probability and linear algebra – admittedly with simplified notation. There’s no real mention of implementation (i.e. programming the examples) as one would usually expect with O’Reilly; but most competent readers will now at least know what they’re “looking for” perhaps in terms of packages to install or if they want to try and implement a system from scratch. It is certainly designed for the intelligent, professional and far from popular science.

Whilst it is very thorough and interesting it could touch a nerve among Data Scientists, since should a manager of a Data Scientist really have to read a book such as this – surely in such a position of authority they should know of these techniques already? (an extreme example would be one footnote which even contains a description of what Facebook is, and what it is used for). Often, such unbalanced hierarchies are the cause of much unnecessary stress and complication in the workplace. However, this is often the case so perhaps this will be useful in that context.

I think, overall, I was hoping for a slightly different book – with more in-depth case studies of how to implement existing Data Science knowledge into Business scenarios. Nevertheless, it’s an interesting, intelligent guide in an encyclopedic sense and fairly unique in its clarity of explanation and accessibility – I highly doubt I could write a better guide in that respect. Existing Data Scientists will find many clear analogies to explain their craft to those less technical than themselves and I reckon that by itself justifies taking a look :-)

When literal_eval fails: using asterisk notation to read in a datetime object

One great function in python is the ast  (Abstract Syntax Tree) library’s literal_eval . This lets you read in a string version of a python datatype:

Importing a dictionary such as this is similar to parsing JSON using Python’s  json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

But this still doesn’t truly fix our datetime object problem:

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class  datetime.datetime(2013, 8, 10, 21, 46, 52, 638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like: 

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.