100 Games of Pong
Challenge
Given Pong......how do we run 100 games as quickly as possible without changing the game?
Rules
 Complete 100 games in the least amount of time.
 All of the basic components and rules must remain and the statistics
must resemble the original, implying:
 First player to 11 wins.
 The paddles must have a chance at returning a given ball.
 The average points per game must resemble the original.
 The maximum speed of any component is 10 (specific to this simulator, we'll see this later).
Initial Impression
Looks pretty slow, there should be lots of lowhanging fruit. A cursory glance suggests that the speed of the ball is the limiting factor.Profile
We start out by profiling obvious factors that relate to the Rules, beginning with the mean game time so we have quantitative feedback what effect our changes have. Since it was explicitly mentioned that we must preserve statistical outcomes we'll also profile statistics such as mean total score and mean paddle touches per point to ensure that our quantitative optimizations don't break the game.
profile:true
So now that we've got profiling we've got a baseline to compare our optimizations to.
Obvious Stuff First
This should be a quick fix, so let's just dive in. The obvious optimization is to make the ball faster.
profile:true
Ok, so it took longer than we thought but in the end we've optimized the lowlevel timing and ball code and made the ball move twice as fast! We expect this to double the speed of the game. In fact, it more than doubles the speed because the paddles are now too slow and have difficulty returning the ball at all. No problem, we'll just speed up the paddles.
paddle:{speed:10}
profile:true
The original paddle speed was 3, so our best guy stayed up late over the weekend to tweak the paddle and timing code to eek out the maximum possible paddle speed. They should be plenty fast.
They worked extremely well... too well. The first point never ended. The unexpected 0 games completed also happened to caused a divisionbyzero in our reporting, but that's beside the point. It looks like those paddle optimizations went a little overboard, since we can't speed up the ball any more we'll have to slow the paddles back down...
paddle:{speed:5}
profile:true
So we just threw a sleep();
in there to slow the paddles down to 5
so they have a reasonable chance of returning a ball served at speed:10,
which results in a more balanced outcome.
Our paddle speedup had thrown off the delicate timing balance that
we didn't fully appreciate was originally there when we started.
However, this new balance comes at the cost of time. The longer points result in more time per game and slower overall performance.
In fact, we notice that for all our lowlevel uber speedhacks we're more or less back to where we started(!). Oops.
Since the ball moves at the maximum possible speed we now have an effectively optimal single thread of execution — right? Are our opportunities for improvement exhausted?
Since we know we want to play 100 games, we could run multiple games at once. Because our games don't depend on each other whatsoever they are embarassingly parallel; little effort is required to run them in parallel. We can run at least as many games at once as we have CPUs, maybe multiple per CPU.
paddle:{speed:5}
profile:false
By using N CPUs we get an Nwise speedup which is great, but because our game generates very little actual I/O (a few stats must be saved at the end of each game) our app is bound by the speed of our memory bus, CPU cache and clockspeed so we don't benefit from running multiple games per CPU.
But there's a limit to the free lunch a CPU can give our serial software — beyond a certain point we need to look beyond the hardware and within the model our software implements. So let's put aside the parallel hardware stuff for now and work on optimizing a single instance — we'll parallelize it at the end.
We are still limited by time. How can we do more work in less time? Look at the paddles in the game above — they spend ~75% of their time idle and only ~1% actually hitting (or missing) the ball. Can we reduce the amount of time the paddles sit idle?
Parallelizing a Serial Algorithm
Like a realworld ping pong game our games progress one point after another in a
serial fashion:
The serial nature of this algorithm mean that even though we know what we need to do in advance we can't do it because we're waiting for the previous step to finish. 
Ideally, we could work on multiple points at once.
If we did our chart should look more like this:

So let's try running multiple points at once in the same game.
Synchronous vs. Asynchronous
paddle:{speed:10}
profile:true
This is the asynchronous or event model — because of the huge latency differential between moving the paddle and moving the balls we can multiplex multiple balls between each paddle.
This greatly reduces the paddle idle time but comes at a cost in the form of greater code complexity — asynchronous code is longer, harder to reason about and more difficult to maintain than equivalent synchronous code.
Is this complexity overhead acceptable? It makes the code more errorprone and is thus relatively more expensive to maintain. However, it's also a lot faster and our goal is speed, so maybe it's worth the cost after all. How can we tell?
Amdahl's law can be used to describe expected parallel speedup using the following factors:
 \(P\): the fraction of the work that's parallelizable from \(0..1\). In our case 100%, represented as \(1\).
 \(S\): the amount we can speed \(P\) up by. in our case it's the number of balls we play at once: \(11\).
\(\frac{1}{(1P)+(\frac{P}{S})}\) Amdahl's equation \(\frac{1}{(11)+(\frac{1}{11})}\) plug in our numbers \(\frac{1}{(\frac{1}{11})}\) simplify \(11\) done
Hmm, it just reduces to \(S\). Embarassingly parallel problems, by definition, are completely parallelizable and can always by spedup by \(S\). So, we can expect our parallel version to run 11 times faster than the original, and we see that this is indeed reflected in the sec/game statistic between the serial and parallel versions.
Ultimately in this case the cost of complexity is acceptable because of the significant performance gains.
Are there any more opportunities for improvement?
The limiting factor is still waiting for the balls. But they're already travelling at the maximum rate — is there anything else we can do?
Let's consider the term rate (velocity) defined as:
\(\bar v = \frac{\Delta d}{\Delta t}\)
The only way to increase \(\bar v\) for a given \(\Delta t\) is to increase \(\Delta d\), but we're already moving at the stated maximum speed (10, as defined in the Rules). But what if we flipped this problem on its head and instead of trying to go faster we reduced the distance we needed to travel in the first place?
\(\sqrt{\max(x,y)^2 + \max(x,y)^2}\)\(\sqrt{2 \times \max(x,y)^2}\)
\(\sqrt{2} \times \max(x,y)\)
\(\approx 1.4142 \times \max(x,y)\)
Given \(x \gets 360, y \gets 240\)
\(\approx 1.4142 \times \max(360,240)\)\(d \approx 509.11\)
If we reduce the ball's angle of incidence we can decrease the distance they need to travel to reach the other side. The lowest possible angle is 0. Let's try that.
paddle:{speed:10}
profile:true
With angle=0 distance from one side of the rectangle to the other simply becomes \(x\)
Given \(x \gets 360, y \gets 240\)
\(d = 360\)We have reduced the constant factor \(\sqrt{2}\) to \(1\). This is backed up empirically by our profiler.
This optimization has the sideeffect of eliminating the familiar floor/ceiling ricochet effect altogether. Though the ricochet is an important part of gameplay we see ppg stats on par with our original, so we can probably do without it.
Our increasingly optimal software looks less and less familiar and less intuitive, but this is ultimately the course that optimization above all else inevitably leads.
We've taken it pretty far — single games run 10× faster and a parallel implementation of an embarassingly parallel problem over N CPUs will, by Amdahl's law, further improve that remaining 10% by N.
So, that' probably it, right?
Despite our work, the balls are still the limiting factor. The speed of the balls is limited by x so let's reduce x!
ball:{speed:10, count:11, angle:0}
paddle:{speed:10}
profile:true
We can of course reduce x even further.
Because the width is so narrow we also increase the paddle height to block more balls to achieve a similar ppg.
interGamePause:1sec
ball:{speed:10, count:11, angle:0}
paddle:{speed:10, height:35px, width:5px}
profile:true
The gameplay here progresses much faster not because we've sped up the balls but because we've reduced the distance they must travel. Our simulations still vaguely resemble Pong and they run so much faster that the 1 second pause inbetween games displaying the WINNER now represents a significant portion of the total time. We can now play 100 games in about the same time as we can play 1 game of the original.
Next we eliminate the intergame ("Winner") pause, which speed us up ~2x:
interGamePause:0sec
ball:{speed:10, count:11, angle:0}
paddle:{speed:10, height:35px, width:5px}
profile:true
This produces a game that looks very fast, and indeed it is. Yet the balls here move no faster than they did originally; their speed has not changed. Instead we have put that speed to better use.
Now that we're much closer to optimal we can look at parallelizing...
We now have approximately 300 times as many games playing themselves out in the same space/time as the original. All by bending the original rules.
Are we there yet? In the real world the answer is yes! We've done a good job.
Conclusions
Through Pong we've visually demonstrated the benefits of program analysis, profiling, concurrency and parallel algorithm choice associated with modeling and understanding risks and benefits of optimizing computer programs.Even within the trivially simple model of Pong we discovered multiple different opportunities for concurrency (gamelevel and pointlevel) and a wide variety of separate factors, some subtle, that affect outcome.
Ultimately an understanding of the software's model and choice of algorithm to reduce the limiting factor of the idle paddles were far more fruitful than focusing on raw lowlevel speed improvement.
Focusing on the most obvious factor at the outset (ball speed) ultimately yielded far less speedup than the easily overlooked paddle speed, but this in turn only became obvious after we looked at the overall algorithm.
The decisions we've made for the sake of optimization result in faster software, but comes at the cost of being harder to understand and maintain.
See Also
 jquerypong
 Herb Sutter The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, March 2005
 M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era IEEE Computer, 41(7):33–38, 2008.
 Speedup, Wikipedia
 Amdahl's law, Wikipedia
 Kas Thomas Go fast, do less Mar 23 2009
 Profiling, Wikipedia
 CPU bound, Wikipedia
 Multiplexing, Wikipedia
 Asynchronous I/O, Wikipedia
 Embarassingly Parallel, Wikipedia
 Quadtree, Wikipedia