I’ve been using the multiprocessing library in Python quite a bit recently and started using the shared variable functionality. It can change something like this from my previous post:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from multiprocessing import pool from pymongo import Connection def how_many(server_number): """Returns how many documents in the collection""" c = Connection("192.168.0." + server_number) #connects to remote DB return c.MyDB.MyCollection.count() pool = Pool(processes=4) servers = [1,2,3,4] result = pool.map(how_many,servers) #map stage pool.close() #reduce stage result = sum(result) print "You have {0} docs across all MongoDB servers!".format(result) |
Into a much nicer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from multiprocessing import Value, pool from pymongo import Connection def how_many(server_number): """Returns how many documents in the collection""" c = Connection("192.168.0." + server_number) #connects to remote DB shared_count.value += c.MyDB.MyCollection.count() #setup pool = Pool(processes=4) servers = [1,2,3,4] shared_count = Value("i", 0) result = pool.map(how_many,servers) #map stage pool.close() print "You have {0} docs across all MongoDB servers!".format(shared_count) |
Thus eliminating the reduce stage. This is especially useful if you have a shared dictionary which you’re updating from multiple servers. There’s another possible shared datatype called Array, which, as it suggests, is a shared array. Note: One pitfall (that I fell for) is thinking that the "i" in Value("i", 0) is the name of the variable. Actually, its a typecode which stands for “integer”.
There are other ways to do this, however, each of which has its own trade offs:
# | Solution | Advantages | Disadvantages |
1 | Shared file | Easy to implement and access after | Very slow |
2 | Shared mongoDB document | Easy to implement | Slow to constantly query for it |
3 | Multiprocessing Value/Array (this example) | Very fast, easy to implement | On 1 PC only, can’t be accessed after process is killed |
4 | Memcached Shared Value | Networked aspect is useful for big distributed databases, shared.set() function is already available | TCP could slow you down a bit |