MongoDB aspirin

Basically this is a list in progress of common errors/tasks/gripes I get when using Mongo. I’ve noted down what usually works. Maybe you’ll find it useful :-)

Why has my remote mongodb connection been refused?

    • Delete the mongod.lock  file from your main mongodb storage folder and restart [SO]
    • If on redhat, check  sestatus
    • Modify /etc/sysconfig/iptables   to have the correct firewall rules according to the mongodb docs.

How do can you iterate through all mongoDB collections in pymongo? 

I normally access collections as object attributes like:   conn.Database.Collection.find_one() but actually databases and connections can be accessed as keys in a dictionary as well:

Why is mongod terminating whenever I close the shell? Even when using &  at the end

When starting mongod, use mongod --fork (note: fork must be right after the word mongod) and it will start as a background process instead. Or just add fork = true  to your config.

I just created an authenticated database and can’t even use show dbs !

Create a new user with all 4 of the following permissions: userAdminAnyDatabase, readWriteAnyDatabase, dbAdminAnyDatabase, clusterAdmin:

Shared Variables in Python Multiprocessing to pre-map/reduce

I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:

Into a much nicer:

Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the "i"  in Value("i", 0)  is the name of the variable. Actually, its a typecode which stands for “integer”.

There are other ways to do this, however, each of which has its own trade offs:

# Solution Advantages Disadvantages
1 Shared file Easy to implement and access after Very slow
2 Shared mongoDB document Easy to implement Slow to constantly query for it
3 Multiprocessing Value/Array (this example) Very fast, easy to implement On 1 PC only, can’t be accessed after process is killed
4 Memcached Shared Value Networked aspect is useful for big distributed databases, shared.set() function is already available TCP could slow you down a bit

A quick solution to OrderedDict’s limitations in Python with O(1) index lookups

Background to the Problem

I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:

f0 f1 f2 f3 fn
1 0 1 0 0

Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):

@ATTRIBUTE f0 NUMERIC
@ATTRIBUTE f1 NUMERIC
@ATTRIBUTE f2 NUMERIC
@ATTRIBUTE f3 NUMERIC

@ATTRIBUTE fn NUMERIC
@DATA
{0 1, 2 1}

i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.

The Implementation Problem

This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:

I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:

The solution

What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:

Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:

 Limitations

This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.

WNYC Radio: “Are Hackathons Worth It?”

I was recently contacted by Jeff Coltin, a journalist at WNYC Radio, who asked me to participate in a show about hackathons in NYC.

He featured a snippet from our conversation, specifically about problems that the hacker community could solve. I said (vaguely accurate transcription):

“…There are so many problems that hackathons could fix. I think some big issues at the moment in the media, things like the NSA spying scandals and stuff like that. I think one thing the tech community has slightly failed to do is to make encryption really easy. There’s a sort-of inverse relationship between simplicity and security, so the more secure an app, often the more inconvenient it is to use. So we have things like TOR, extra-long passwords (TOR slows down your connection a lot), VPNs and a lot of very secure services are incompatible with mainstream services. So this level of security and privacy that users want or need is just so inconvenient to achieve its really up to the hacker community to make them much easier to use…”

There have been efforts such as Cryptocat but its adoption rate still needs to grow. HTTPS would probably be the best example of seamless encryption but this often fails when people either ignore or are at loss as to what to do when HTTPS certificates are flagged as invalid by the browser.

Cryptography is an incredibly tough field of Computer Science, so creating reliably secure apps is hard. Educating oneself about this can require a fairly super-human effort and I have a lot of respect for people who contribute modules in this field to PyPI. I’m hoping to start the Crypto course on Coursera once I have some more free time, but beating the security-simplicity inverse relationship I mentioned is certainly easier said than done.

Teaching Python at Harvard with Software Carpentry

BScO7AGIMAAvVZj.jpg-medium

Mike teaching Hamlet in Python. Photo copyright Chris Erdmann: https://twitter.com/libcce/status/371281901191196672

I’m part of an organization called Software Carpentry in NYC. This uses volunteers to teach programming at varying levels to universities, large governmental organizations and other interested groups of people. I previously taught at Columbia and this past weekend it was held at Harvard, organized by Chris Erdmann, the head librarian at the Harvard-Smithsonian Center for Astrophysics.

Before Software Carpentry, my teaching experience was limited to explaining aspects of programming to friends and family, as well as part of a year spent teaching English and French to children and adults in Japan. Teaching is hard. It’s very easy to be critical of a teacher – I’ve often found myself being so without thinking about the effort and stress behind conveying a complex concept to a group of students all with varying backgrounds and motivations. I’ve come up with a few conclusions about how to optimize teaching style from my last 2 SWC events:

Saturday’s Teacher line-up

Things that worked well

  • Humor. Mike sprinkled his tutorial with funny anecdotes which kept the class very lively.
  • Relevant and interesting subject matter. Hamlet was a good choice, as was the theme of cheating at scrabble due to the librarian-oriented audience. The dictionary brought up several amusing entries for searches like:  grep ".*s.*s.*s.*s.*s.*s" words | less
  • Adding anecdotes to save people googling things. I reckon that a large amount of any programmer’s activities are in simply finding someone who’s done what you want to do before, and slightly modifying things – or connecting up the building blocks. So at the end of talking about the benefits of things like append()  vs concatenating with plus signs like first+second , I mentioned things like deque()  and  format() .

Things to remember for next time

  • Typing SLOWLY. I work a lot with MongoDB, so end up typing from pymongo import Connection; c = Connection()  20+ times a day into the terminal. This can become so fast, things like that can seem bewildering to newcomers.
  • Using a high contrast terminal with large font and dimmed lights, to make it super easy to see from the back of the room.

What can advanced programmers get out of teaching such basic things?

  • You’ll learn a lot from the instructors and student’s questions
  • Community involvement is a great asset on your resume and shows potential employers that you have the ability/drive to train future co-workers
  • It helps to have on-hand analogies and anecdotes developed during teaching when explaining technical matters to non-technical people, socially or business-wise.
  • You’ll meet many like minded people and it feels great to get involved in the community.

What did I learn?

  • The requests library. I normally use urllib2 to grab html from web pages. Urllib2, it turns out, is simply a more extensible library for HTTP requests as shown in this stackoverflow explanation.
  • More about Git. I use SVN at work and thus don’t really submit anything to github. Git is HARD. Erik was an excellent instructor and calmly went from the basics right through to the minutiae of things like .gitignore and diff.
  • What “immutable” really means. I hear this thrown around quite a lot and it basically just means things can’t be assigned to an object. E.g. the . split()  of myString.split()  can’t become a variable. Very simple.

Review of Data Science for Business (O’Reilly, 2013)

Book cover

I’m currently participating in the O’Reilly Blogger Review Program – where bloggers are given ebooks of recent publications. 

Data Science for Business fits an interesting gap in the market – managers who want to be able to understand what Data Science is, how to recruit Data Scientists or how to manage a data-oriented team. It says it is also for aspiring Data Scientists, but I would probably recommend Andrew Ng’s Machine Learning course and Codecademy’s intro Python course instead if you’re serious about getting your teeth into the field.

Somewhere between an introduction and an encyclopedia, it gives fairly comprehensive overviews of each sub-field, including distinctions that I hadn’t previously thought of so clearly. The authors are mostly unafraid to explain the maths behind the subjects. It dips into some probability and linear algebra – admittedly with simplified notation. There’s no real mention of implementation (i.e. programming the examples) as one would usually expect with O’Reilly; but most competent readers will now at least know what they’re “looking for” perhaps in terms of packages to install or if they want to try and implement a system from scratch. It is certainly designed for the intelligent, professional and far from popular science.

Whilst it is very thorough and interesting it could touch a nerve among Data Scientists, since should a manager of a Data Scientist really have to read a book such as this – surely in such a position of authority they should know of these techniques already? (an extreme example would be one footnote which even contains a description of what Facebook is, and what it is used for). Often, such unbalanced hierarchies are the cause of much unnecessary stress and complication in the workplace. However, this is often the case so perhaps this will be useful in that context.

I think, overall, I was hoping for a slightly different book – with more in-depth case studies of how to implement existing Data Science knowledge into Business scenarios. Nevertheless, it’s an interesting, intelligent guide in an encyclopedic sense and fairly unique in its clarity of explanation and accessibility – I highly doubt I could write a better guide in that respect. Existing Data Scientists will find many clear analogies to explain their craft to those less technical than themselves and I reckon that by itself justifies taking a look :-)

When literal_eval fails: using asterisk notation to read in a datetime object

One great function in python is the ast  (Abstract Syntax Tree) library’s literal_eval . This lets you read in a string version of a python datatype:

Importing a dictionary such as this is similar to parsing JSON using Python’s  json.loads decoder. But it also comes with the shortcoming’s of JSON’s restrictive datatypes, as we can see here when the dictionary contains, for example, a datetime object:

So you might try and write some code to parse the dictionary data-type yourself. This gets very tricky, but eventually you could probably accommodate for all common data-types:

But this still doesn’t truly fix our datetime object problem:

Which is where we get to the crux of this post. I thought at first that I could deal with datetime’s formatting by extracting the class  datetime.datetime(2013, 8, 10, 21, 46, 52, 638649) as a tuple by spotting the brackets, then feeding the tuple back into datetime like: 

But apparently not. The tuple must be extracted – not by a lambda or perhaps list comprehension, but in fact by using asterisk notation:

Asterisk ( * ) unpacks an iterable such as x into positional arguments for the function. Simple!

Using ACORA to process hundreds of stopwords at once

“80% of data analysis is spent cleaning data, 20% of it is spent complaining about cleaning data” – Chang She and BigDataBorat

This is one of the best quotes I heard at PyData 2013. When dealing with huge amounts of data, often only a fraction of it is usually relevant to one’s analysis and it can be a total pain trying to clean it. But this is also an essential stage, so let’s make it as painless as possible.

One example is with gigantic log files. Say we’re dealing with a multi-terabyte apache log files as follows:

This is useful data with thousands of lines, and we’d like to analyze it using the big file processing script I mentioned before. However, there are certain lines that you’re not concerned about – so you can write a simple conditional:

What if you have 2 things that you don’t want in each line?

What if you have 3 things that you don’t want in each line?

But this is getting super inefficient and a bit silly. Each extra keyword requires yet another pass through the line. With this code basically everything is a worst case scenario.

Bring on ACORA!

ACORA is Stefan Behnel’s library based on the Aho-Corasick string matching algorithm. Without diving too deep into the maths behind it, it basically compiles all the stopwords you have into a single über-stopword, meaning one scan of this stopword over your log-file line will check for all stopwords. For example:

But how do we integrate this into the line scanner from before? Just like this!

We’ve replaced the entire multiple stopword matching for-loop with a single ACORA matcher.

A note on performance

ACORA is fantastic, but performance may dip if there are only a few stopwords, or only a few lines. It has best performance when you have about 20+ stopwords and at least 1000 or so log file lines to scan through.

Extracting TLDs (top-level domains) – and weird quirks that will upset you

I’ve been using John Kurkowski‘s excellent Python domain extraction library “tldextract” recently. TLDextract can extract the domain name from a URL very easily, for example:

Why is this useful?

This has many applications – for example, if you want to create a summary of the top domains linking to your site, you might have a very large list of referring URLs:

And you could write some simple code to output the domain:

And use the word frequency calculator from my previous post to compile a list of the top referring domains! See that I’ve modified line 10 to instead add the domain as the key:

Which returns:

Why can’t you just split by fullstops at the third slash and take what’s before?

This is what I tried to do at the start:

But since the domain name system is a miasma of top level (e.g. .com), second level (e.g. .gov.uk), standard sub domains (e.g. i.imgur.com) and people with too many fullstops (e.g. www.dnr.state.oh.us) this becomes much more tricky and it becomes impossible to accommodate for everything. So TLDextract actually maintains a local copy of Mozilla’s list of ICANN domains on your system, downloaded from: 

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

And basically finds matches on the ends of URLs from that. Very nice!

So what’s the problem mentioned in the title?

Unfortunately, the caveat of using Mozilla’s list is that you get some seemingly odd behavior. There are a bunch of sites and companies who have requested that their subdomains are TLDs, and are included in the list, from Amazon:

To DynDNS stuff:

And more… So you’ll trip up if you put in something like:

Rather than the expected “.com” as the tld.

Succinct way to build a frequency table of a Python iterable

This is an interesting and often tackled task in programming, and especially prevalent in NLP and Data Science. Often one has a list of “things” with many repeats and wants to know which ones are the most popular.

Data Examples

or:

Which is the most popular number? Which is the most popular letter?

The Code

For our data, we now get:

So 5 is the most popular item, with 9 appearances.

So the space is the most popular, followed by a, e and t.