Background to the Problem
I work regularly with gigantic machine learning datasets. One very versatile format, for use in WEKA is the “ARFF” (Attribute Relation File Format). This essentially creates a nicely structured, rich CSV file which can easily be used in Logistic Regression, Decision Trees, SVMs etc. In order to solve the problem of very sparse CSV data, there is a sparse ARFF format that lets users convert sparse lines in each file such as:
f0 | f1 | f2 | f3 | … | fn |
1 | 0 | 1 | 0 | … | 0 |
Into a more succint version where you have a list of features and simply specify the feature’s index and value (if any):
@ATTRIBUTE f0 NUMERIC
@ATTRIBUTE f1 NUMERIC
@ATTRIBUTE f2 NUMERIC
@ATTRIBUTE f3 NUMERIC
…
@ATTRIBUTE fn NUMERIC
@DATA
{0 1, 2 1}
i.e. {feature-index-zero is 1, feature-index-two is 1}, simply omitting all the zero-values.
The Implementation Problem
This is easy enough if you have, say 4 features, but what if you have over 1 million features and need to find the index of each one? Searching for a feature in a list is O(n), and if your training data is huge too, then creating the sparse ARFF is going to be hugely inefficient:
1 2 3 4 5 6 7 8 9 |
features = ['name', 'some_metric', 'an_attribute', 'some_boolean'] #Searching for the existence of a feature is O(n) >>> 'some_metric' in features True #Retrieving the index of a feature in the list is also O(n) >>> features.index('some_metric') 1 |
I thought I could improve this by using an OrderedDict. This is, very simply, a dictionary that maintains the order of its items – so you can pop() items from the end in a stack-like manner. However, after some research on StackOverflow, this disappointingly this doesn’t contain any efficient way to calculate the index of key:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from collections import OrderedDict features = ['name', 'some_metric', 'an_attribute', 'some_boolean'] od = OrderedDict() for f in features: od[f] = 0 #Searching for the existence of a feature is O(1) >>> 'some_metric' in od True #Retrieving the index of a feature in the list is still O(n) though! >>> features.keys().index('some_metric') 1 #keys() has to create an entire list of all the keys in memory to #retrieve the index. You could use iterkeys() to improve memory #performance, but its still pretty ridiculous. |
The solution
What can we do about this? Enter my favorite thing ever, defaultdicts with lambdas:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from collections import defaultdict features = ['name', 'some_metric', 'an_attribute', 'some_boolean'] dd = defaultdict(lambda: len(dd)) for f in features: dd[f] #no need to give it a value, it just needs to be called #So now we can do lookups in O(1) >>> 'some_metric' in dd True #And also get its index in O(1) >>> dd['some_metric'] 1 |
Assigning items values in addition to the index is fairly straightforward with a slightly modified lambda:
1 2 3 4 5 6 7 8 9 10 11 |
dd = defaultdict(lambda: {'index':len(dd)}) #Then more information can seamlessly added: dd['some_metric']['info'] = 1 dd['some_attribute']['info'] = 2 #Whilst maintaining the O(1) lookup of the auto-generated index: >>> dd['some_attribute']['index'] 1 |
Limitations
This is a fun fix, but doesn’t support full dictionary functionality – deleting items won’t reorder the index and you can’t iterate in order through this easily. However, since in creating this ARFF file, there’s no need for deletions or iteration that’s not a problem.