written by Eric J. Ma on 2015-11-28
As part of my PhD dissertation, I wanted to investigate the role of host ecology on the generation of reassortant viruses. Knowing myself to be a fairly algebra-blind person, I decided that an agent-based model (ABM) was going to be much more manageable than writing ODEs. (Actually, the real reason is that I"m modelling discrete states, rather than continuous states, but yes, I will admit that I do take longer than your average programmer with algebra.)
Starting with our intuition of host-pathogen interactions, I implemented a custom ABM using Python classes - "Hosts" and "Viruses".
"Viruses" had two segments, representing a segmented virus (like the Influenza or Lassa virus), each with a color (red or blue), and can infect Hosts (which are likewise red or blue). Viruses that are of a particular color prefer to infect hosts of the same color, but can still infect hosts of of a different colour, just at a lower probability. If two viruses are present in the same host, then there can be, at some small probability, the opportunity for gene sharing to occur.
One of the virus' segments determines host immunity; if the virus encounters a host which has immunity against its color, then the probability of infection drastically decreases, and it is likely that the virus will eventually be cleared.
"Hosts" are where viruses replicate. Hosts gain immunity to one of the segment's colors, after a set number of days of infection. When a host gains immunity to a particular virus color, it can much more successfully fend off a new infection with that same color. Hosts also interact with one another. They may have a strong preference for a host of the same color, a.k.a. homophily.
My code for the simulations can be found on this Github repository. The details of the simulation are still a work in progress, as these ideas are still early stage. My point on this blog post here will be to try to compare PyPy against CPython on performance. However, I do welcome further comments on the modelling, if you've taken the time to read through my code.
Code for the statistical draws can be found on this other Github repository.
conda
. (Highly recommended! Download here. Make sure to get Python 3!)
I installed pypy
and pypy3
under my home directory on Ubuntu Linux, and ensured that my bash shell $PATH
variable also pointed to ~/pypy[3]/bin
.
Let's take a look at the performance of the CPython vs. PyPy using pure-Python code.
I first started with 1000 agents in the simulation, with the simulation running for 150 time steps.
Under these circumstances, on an old Asus U30J with 8GB RAM and an SSD hard disk, Core i3 2.27GHz, executing the simulation with PyPy required only 13.4 seconds, while executing with CPython required 110.5 seconds. 10x speedup.
I wanted to measure the time complexity of the simulation as a function of the number of hosts. Therefore, I varied the number of hosts from 100 to 1600, in steps of 300.
Partial (mostly because of laziness) results are tabulated below. (Yes, this degree of laziness would never fly in grad school.)
Agents | PyPy Trial 1 | PyPy Trial 2 | PyPy Trial 3 | CPython Trial 1 | CPython Trial 2 | CPython Trial 3 |
---|---|---|---|---|---|---|
1000 | 13.4 | 12.8 | 12.9 | 110.5 | ||
700 | 8.63 | 9.02 | 8.65 | 53.7 | ||
400 | 4.35 | 4.33 | 4.66 | 18.2 | 18.2 | |
100 | 1.03 | 1.00 | 1.17 | 1.47 | 1.48 | 1.45 |
As we can see, PyPy wins when the number of iterations is large.
I use statistical Bernoulli trials (biased coin flips) extensively in the simulation. Yet, one thing that is conspicuously unavialable to PyPy users (in an easily installable format) is the scientific Python stack. Most of that boils down to numpy
. Rather than fiddle with trying to get numpy
, scipy
and other packages installed, I re-implemented my own bernoulli
function.
```python
from random import random
class bernoulli(object):
"""
docstring for bernoulli
"""
def __init__(self, p):
super(bernoulli, self).__init__()
self.p = p
def rvs(self, num_draws):
draws = []
for i in range(num_draws):
draws.append(int(random() > self.p))
return draws
This is almost a drop-in replacement for scipy.stats.bernoulli
. (The API isn't exactly the same.) I wanted to know whether the calling bernoulli
function I wrote performed better than calling on the scipy.stats
function. I therefore setup a series of small tests to determine at what scale of function calls it makes more sense to use PyPy vs. CPython.
I then wrote a simple block of code that times the Bernoulli draws. For the PyPy version: ```python from stats.bernoulli import bernoulli from time import time
start = time()
bern_draws = bernoulli(0.5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)
end = time()
print(end - start)</code></pre>
And for the CPython/scipy version:
<pre><code>from scipy.stats import bernoulli
from time import time
start = time()
bern_draws = bernoulli(0/5).rvs(10000)
mean = sum(bern_draws) / len(bern_draws)
end = time()
print(end - start)
Bernoulli Draws | PyPy + Custom (1) | PyPy + Custom (2) | PyPy + Custom (3) | CPython + SciPy (1) | CPython + SciPy (2) | CPython + SciPy (3) |
---|---|---|---|---|---|---|
1000000 | 0.271 | 0.241 | 0.206 | 0.486 | 0.513 | 0.481 |
100000 | 0.0437 | 0.0421 | 0.0473 | 0.0534 | 0.0794 | 0.0493 |
10000 | 0.0311 | 0.0331 | 0.0345 | 0.00393 | 0.00410 | 0.00387 |
As we can see, scipy
is quite optimized, and outperforms at lower number of statistical draws. Things only become better for PyPy as the number of draws increases.
Some things that I've learned from this exercise:
numpy
is not, right now, easily pip
-installable. Because of this, the rest of the Scientific Python stack is also not pip
-installable in a PyPy environment. (I will admit to still being a learner here - I wouldn't be able to articulate why numpy
doesn't work with PyPy out-of-the-box. Experts chime in please?)Some things I hope will happen:
numba
project allows JIT compilation when using Python objects instead.As is the usual case for me, starting a new project idea gave me the impetus to try out a new thing, as I wouldn't have to risk breaking workflows that have worked for existing projects. If you find yourself in the same spot, I would encourage you to try out PyPy, especially for pure-Python code (i.e. no external libraries are used).
@article{
ericmjl-2015-profiling-simulation,
author = {Eric J. Ma},
title = {Profiling PyPy vs. Python for Agent-Based Simulation},
year = {2015},
month = {11},
day = {28},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2015/11/28/profiling-pypy-vs-python-for-agent-based-simulation},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!