Category Archives: Uncategorized

Setting up a development environment for Firefox Extensions.

This is the method I use to create simple firefox extensions. This tutorial is a precursor to the next one which is about using Web Workers (i.e. allowing code to run on background threads).

Setting up the environment

We’re going to need the Firefox Addon SDK. This is a collection of python files that will let you run a test (optionally blank) version of Firefox. To download it:

Now extract it and remove the tarball:

Go to the directory and startup the special shell:

Now you can see that the shell has prepended (addon-sdk-1.17) in brackets to the prompt. This means that the window is probably half filled with text so we can reduce that with the command:

Much cleaner! ūüôā

Setting up the extension template

Now that we have this special addon-sdk shell, navigate back to your documents and create a new folder for our extension.

This special shell has various useful commands included, which all look like  cfx xyz . For more about them see here. In this case we use  cfx init

Let’s inspect what was created:

  • ¬†lib¬†¬†contains a file called main.js¬†¬†which is the main handler file for all extension code
  • data¬†¬†is empty but can be used to store things like workers (which we will come to later) or large data files
  • test¬†¬†can contain unit tests (quite hard to set up but useful for test driven development later)
  • package.json¬†¬†contains metadata about the extension – version number, name of the creator, description, licensing etc

You can start writing code in main.js and it will run in the browser. Once finished, use  cfx run to test it!

See the next tutorial on how to write a firefox extension using web workers!

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the¬†“ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc.¬†In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):

@ATTRIBUTE f0 NUMERIC
@ATTRIBUTE f1 NUMERIC
@ATTRIBUTE f2 NUMERIC
@ATTRIBUTE f3 NUMERIC

@ATTRIBUTE fn NUMERIC
@DATA
{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:

 Limitations

This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

WNYC Radio: “Are Hackathons Worth It?”

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that.¬†I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use.¬†So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So¬†this level of security and privacy that users want or need is just so inconvenient to achieve¬†its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

Hackathon Report – Greener Neighbor – Neighborhood Green-ness rankings in NYC

  

Background

In July 2012 I attended and won the Judge’s Pick at a Hackathon called “Reinvent Green“, organized by NYC.gov at NYU Poly. There I met and teamed up with Anton Granik, a great graphic designer and digital director.

ReinventGreen Hackathon Contestants Presenting GreenerNeighbor

Idea – “Green, Greener, Greenest”!

Our idea was to create a site which could promote competitiveness between boroughs, zip codes or city blocks to be the greenest. I basically imported all the data we needed from NYCGov – that is electrical consumption, trees planted etc into a MySQL database and then indexed it by zip code. Then, Anton designed a front end, and I created some django functionality to let users explore an individual zip code, or see a heat map and ranking of the top zip codes. It took about 15 hours to build.

What could make this idea turn into a real service?

Fresh, reliable data is the biggest problem for this service. We were using NYC OpenData’s datasets, such as “Electric Consumption by Zip Code 2010“. This data contains numerous duplicates and, at the time, was 2 years old. In order for this to work, we would have to had regularly (perhaps monthly) fresh data. The site is currently offline to conserve processing power on my server, but if you want to take a look, tweet to me and I’ll re-enable it ūüôā

Press

The hackathon was featured on Channel 25, NYC.gov, NYC Digital, TheNextWeb and in the Huffington Post.

Meeting Mayor Bloomberg