mdk.fr/content/blog/python-introducing-ppipe-pa...

191 lines
7.0 KiB
Markdown

Title: Python: Introducing ppipe : Parallel Pipe
> /!\ this was highly experimental code written in 2011.
> Today you should _NOT_ use it, just look at it if the subject amuses you.
> Better take a look at `asyncio` if you want this kind of tools.
I'll speak about my pipe python module so if you didn't know it, you
should first read [the first aricle about
pipes](http://dev-tricks.net/pipe-infix-syntax-for-python). The idea
behind `ppipe` (parallel pipe) is to transparently make an
asynchronous pipe, multithreaded or not. As multithreading isn't an
easy piece of code for everybody, with loads of pollutions like locks,
queues, giving code far away from the actual simple task you tried to
do... The idea is that one kind of multithreading can be nicely
handled with a simple design pattern well implemented in python: the
queue. The queue handles all the locking part so you don't have to
worry about it, just make your workers work, enqueue, dequeue, work
... but you still have to create workers! As pipe is not far away from
the concept of queue, as, every part of a pipe command works on a
piece of data and then git it to the next worker, it's not hard to
imagine an asynchronous pipe in which every part of a pipe command can
work at the same time. Then it's not hard to imagine that n threads
can be started for a single step of a pipe command, leading to a
completly multithreaded application having \~0 lines of code bloat for
the tread generation / synchronization in your actual code. So I tried
to implement it, keeping the actual contract which is very simple that
is : Every pipe should take an iterable as input. (I was tempted to
change it to 'every pipe must take a Queue as input' ... but if I
don't change the contract, normal pipes and parallel pipes should be
mixed.), so I created a branch you'll found on
[github](https://github.com/julienpalard/pipe/tree/parallel_pipe) with
a single new file 'ppipe.py' that, actually, is not 'importable' it's
only a proof of concept, that can be launched. Here is the test I
wrote using ppipe :
:::python
print "Normal execution :"
xrange(4) | where(fat_big_condition1) \
| where(fat_big_condition2) \
| add | lineout
print "Parallel with 1 worker"
xrange(4) | parallel_where(fat_big_condition1) \
| where(fat_big_condition2) \
| add | lineout
print "Parallel with 2 workers"
xrange(4) | parallel_where(fat_big_condition1, qte_of_workers=2) \
| parallel_where(fat_big_condition2, qte_of_workers=2) | add | stdout
print "Parallel with 4 workers"
xrange(4) | parallel_where(fat_big_condition1, qte_of_workers=4) \
| parallel_where(fat_big_condition2, qte_of_workers=4) | add | stdout
The idea is to compare normal pipe (Normal execution) with asynchronous
pipe (Parallel with 1 worker), as 1 worker is the default, and then 2
and 4 workers that can be given to a ppipe using 'qte\_of\_workers='.
fat\_big\_condition1 and 2 are just f\*cking long running piece of code
like fetching something far far away in the internet ... but for our
tests, let's use time.sleep:
:::python
def fat_big_condition1(x):
log(1, "Working...")
time.sleep(2)
log(1, "Done !")
return 1
def fat_big_condition2(x):
log(2, "Working...")
time.sleep(2)
log(2, "Done !")
return 1
They always return 1... and they log using a simple log function that
make fat\_big\_condition1 to log in the left column and
fat\_big\_condition2 to log in the right column:
:::python
stdoutlock = Lock()
def log(column, text):
stdoutlock.acquire()
print ' ' * column * 10,
print str(datetime.now().time().strftime("%S")),
print text
stdoutlock.release()
And that is the output (integers are the current second, so the times
didn't start at 0...):
Normal execution :
57 Working...
59 Done !
59 Working...
01 Done !
01 Working...
03 Done !
03 Working...
05 Done !
05 Working...
07 Done !
07 Working...
09 Done !
09 Working...
11 Done !
11 Working...
13 Done !
// As you can see here, only one condition is executed at a time,
// that is a normal behavior for a non-threaded program.
Parallel with 1 worker
13 Working...
15 Done !
15 Working...
15 Working...
17 Done !
17 Done !
17 Working...
17 Working...
19 Done !
19 Done !
19 Working...
19 Working...
21 Done !
21 Done !
21 Working...
23 Done !
// Just adding parallel_ to the first where, you now see that it's
// asynchronous and that the two conditions can work at the
// same time, interlacing a bit the output.
Parallel with 2 workers
23 Working...
23 Working...
25 Done !
25 Working...
25 Done !
25 Working...
25 Working...
25 Working...
27 Done !
27 Done !
27 Done !
27 Working...
27 Done !
27 Working...
29 Done !
29 Done !
Parallel with 4 workers
55 Working...
55 Working...
55 Working...
55 Working...
57 Done !
57 Done !
57 Done !
57 Done !
57 Working...
57 Working...
57 Working...
57 Working...
59 Done !
59 Done !
59 Done !
59 Done !
// And now with 2 and 4 workers you can clearly see what
// happens, with 2 workers, input is computed by pairs,
// and with 4 threads, all the input can be computed at once
// but the 4 workers of the 2nd condition have to wait the data
// before starting to work, so in the last test, you have 8 threads,
// only the 4 firsts are working the 2 first second, then only the 4
// others works.
To make the creation of ppipe simple, I excluded all the 'threading'
part in a function usable as a decorator, so writing a parallel\_where
give :
:::python
@Pipe
@Threaded
def parallel_where(item, output, condition):
if condition(item):
output.put(item)
You can see the queue here! :-) Enjoy!