Posts Tagged ‘parallel programming’

Digital Logic Simulation with the Intel® TBB Flow Graph, Part 2: Building bigger components

In Part 1 , I described how to put together a basic logic gate using the Intel® Threading Building Blocks flow graph nodes or_node and multifunction_node . In this blog, I will assume the basic logic gates and_gate , or_gate and xor_gate exist, and use them to construct a four-bit adder. To begin with, I’ll first construct a one-bit full adder as in Figure 2 below: Figure 2 The inputs are A and B, and a Carry-in bit, and the output is the sum S, and a Carry-out bit. Here is the code for the one_bit_adder class: class one_bit_adder { broadcast_node < signal_t > A_port; broadcast_node < signal_t > B_port; broadcast_node < signal_t > CI_port; xor_gate < two_input > FirstXOR; xor_gate < two_input > SecondXOR; and_gate < two_input > FirstAND; and_gate < two_input > SecondAND; or_gate < two_input > FirstOR; graph& my_graph; void make_connections() { make_edge(A_port, FirstXOR.get_in(0)); make_edge(A_port, FirstAND.get_in(0)); make_edge(B_port, FirstXOR.get_in(1)); make_edge(B_port, FirstAND.get_in(1)); make_edge(CI_port, SecondXOR.get_in(1)); make_edge(CI_port, SecondAND.get_in(1)); make_edge(FirstXOR.get_out(), SecondXOR.get_in(0)); make_edge(FirstXOR.get_out(), SecondAND.get_in(0)); make_edge(SecondAND.get_out(), FirstOR.get_in(0)); make_edge(FirstAND.get_out(), FirstOR.get_in(1)); } public: one_bit_adder(graph& g) : my_graph(g), A_port(g), B_port(g), CI_port(g), FirstXOR(g), SecondXOR(g), FirstAND(g), SecondAND(g), FirstOR(g) { make_connections(); } one_bit_adder(const one_bit_adder& src) : my_graph(src.my_graph), A_port(src.my_graph), B_port(src.my_graph), CI_port(src.my_graph), FirstXOR(src.my_graph), SecondXOR(src.my_graph), FirstAND(src.my_graph), SecondAND(src.my_graph), FirstOR(src.my_graph) { make_connections(); } ~one_bit_adder() {} receiver < signal_t > & get_A() { return A_port; } receiver < signal_t > & get_B() { return B_port; } receiver < signal_t > & get_CI() { return CI_port; } sender < signal_t > & get_out() { return SecondXOR.get_out(); } sender < signal_t > & get_CO() { return FirstOR.get_out(); } }; This implementation is almost a straightforward translation of the gates and their connections into the flow graph format. The one complication is the addition of the broadcast_node s for each of the input ports. The reason for this is simply to enable connection to a single port from outside of the adder. Since each of the inputs is connected to two gates inside of the one_bit_adder object, there is no single port associated with them automatically. Adding the broadcast_node s enables us to provide the methods get_A , get_B and get_CI that each return a single port capable of receiving data. So, in looking at the diagram above, you can think of the three broadcast_node s as standing in for the black junction circles that the three inputs are connected to directly. To make the four_bit_adder class, simply chain together a set of four one_bit_adder s and connect the Carry-out port of each adder to the Carry-in port of the next adder, as shown in Figure 3 below: Figure 3 This time, the class is even more straightforward to implement, because no broadcast_node s are needed; every input already has exactly one internal connection. class four_bit_adder { graph& my_graph; std::vector < one_bit_adder > four_adders; void make_connections() { make_edge(four_adders[0].get_CO(), four_adders[1].get_CI()); make_edge(four_adders[1].get_CO(), four_adders[2].get_CI()); make_edge(four_adders[2].get_CO(), four_adders[3].get_CI()); } public: four_bit_adder(graph& g) : my_graph(g), four_adders(4, one_bit_adder(g)) { make_connections(); } four_bit_adder(const four_bit_adder& src) : my_graph(src.my_graph), four_adders(4, one_bit_adder(src.my_graph)) { make_connections(); } ~four_bit_adder() {} receiver < signal_t > & get_A(size_t bit) { return four_adders[bit].get_A(); } receiver < signal_t > & get_B(size_t bit) { return four_adders[bit].get_B(); } receiver < signal_t > & get_CI() { return four_adders[0].get_CI(); } sender < signal_t > & get_out(size_t bit) { return four_adders[bit].get_out(); } sender < signal_t > & get_CO() { return four_adders[3].get_CO(); } }; Here, the constructor makes a vector of exactly four adders, and connects the Carry-out ports to the Carry-in ports as appropriate. The multi-bit inputs and outputs have port access methods that take a bit as a parameter. So for example, to get the input port for bit 2 of input B, you would use get_B(2) . In Part 3 , I will present some interesting input and output devices to add to the logic simulation library, and with those, I’ll put together a small simulation that shows the four_bit_adder in action.

Deterministic Reduction: a new Community Preview Feature in Intel® Threading Building Blocks

Computer Arithmetic has a lot of peculiarities [1] . One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using float s: In a sequential program, it is not a big problem since the calculation order is exactly specified so the result is predictable and repeatable. The situation is not so clear in parallel programming. To make the example parallel, I used the parallel_reduce template function from Intel® Threading Building Blocks (Intel® TBB): std::vector arr( N, 1.0f/(float)N ); float sum = tbb::parallel_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f, []( const tbb::blocked_range& r, float sum ) { return std::accumulate( r.begin(), r.end(), sum ); }, std::plus () ); std::cout

Intel Announces the New Intel® SDK for OpenCL* Applications 2012

In support of the recent announcement of the 3 rd Generation Intel® Core™ Processors , Intel has released the Intel® SDK for OpenCL* Applications 2012. For the first time, OpenCL* developers using Intel® architecture can utilize compute resources across both Intel® Processors and Intel® HD Graphics Driver 4000/2500 From a person who, for the last couple of years has closely followed the emergence of the OpenCL standard, this announcement was something worth waiting for.  Less than a year ago, on this blog, I posted the news that the Intel® OpenCL SDK 1.1 gold  was released ,  This was the first production OpenCL implementation from Intel targeting Intel® processors on Windows* OS. This current announcement is special, the Intel SDK for OpenCL Applications 2012 now supports not only the CPU but also the Intel HD Graphics 4000/2500 for Windows* 7 users.  We’ve come a long way in a year. OpenCL on the 3 rd Generation Intel® Core Processor Family extends Intel’s line of tools and APIs on Intel platforms and adds interoperability with other graphics APIs like DirectX*, OpenGL* and Intel® Media SDK, directly on the Intel HD Graphics device. So what else is new in this release? A Single OpenCL* platform enables shared context for OpenCL applications running on both the CPU and Intel HD Graphics 4000/2500. The OpenCL platform with both CPU and HD Graphics devices is available seamlessly on the Intel® HD Graphics Drivers . Interoperability with the Intel Media SDK with no memory copy overhead Improved performance for OpenCL applications running on Intel® Xeon® Processors and Intel® Core™ Processors. This CPU support is also available for Linux* OS developers. Intel® SDK for OpenCL* applications development tools includes an offline compiler and a step-by-step OpenCL Kernel debugger (for CPU) integrated in Microsoft Visual Studio* 2010 integrated development environment. 10 OpenCL code samples, three of them new, are now available for independent download. The list above is just a sample of what is available with this new SDK. I recommend you read the product brief or watch the introduction video to get started with this new SDK. Download the SDK for free at  www.intel.com/software/opencl and begin optimizing your applications for the 3 rd Generation Intel® Core™ Processors today. Don’t forget to follow us on Twitter at  @IntelOpenCL

Minimize frustration and maximize tuning effort with Amdahl’s Law

I recently had a question from a customer who had introduced a succesful optimization to a hot function in his application, but did not see as much improvement in the overall application as he expected. This is a fairly common occurence in the iterative process of performance tuning. Usually it happens for one of two reasons. 1. Introducing an improvement in one area resulted in inefficiencies somewhere else. This is par for the course with performance tuning, and part of the reason why the process is iterative . It can be hard to anticipate whether a code change you are making in one function will decrease performance somewhere else down the road, and so landing in this situation from time to time is unavoidable. Although you may not be able to always prevent it, using good documentation practices and a tool like Intel® VTune™ Amplifier XE to quantify performance changes can help you see when it is happening. 2. Not enough of the application was optimized (or the optimization was not enough). Fortunately there is a way to predict whether this situation might occur, using Amdahl’s Law . Amdahl’s Law is used a lot for performance work, especially for projecting the theoretical scalability of parallel applications on a given problem size. But another very helpful way to apply the law is during the tuning process. It gives a formula for seeing how much potential overall improvement you will get from improving a fraction of the application. The formula is: Speedup = 1/ ( (1-P) + P/S) , where P is the portion of code improved, and S is the speedup from that portion To use the formula you need to know what percentage of the application’s overall time is being devoted to the function you are improving. You can determine this using the “Hotspots” or “Lightweight Hotspots” analysis in VTune Amplifier XE. These analysis types give you a breakdown of the functions called in your application, and how much time each took. To figure out the percentage, you would look at the CPU_CLK_UNHALTED (clockticks) value for the function you want to improve, then divide by the overall time (one way to get this is to highlight all the rows from your application, then look at the bottom to see the total CPU_CLK_UNHALTED value). Then you would need to estimate how much you think you can improve the performance of that function. If, for example, you are vectorizing a loop in the function, you can use the speedup you expect from vectorization, which you can compute from the size of your data and the size of the SIMD registers available. So, for example, if you had a function taking 20% of the total application time, and you were planning to improve it by 8x from vectorization , you could use the Amdahl’s formula to compute the maximum potential improvement you could achieve for the overall application: 1 / ((1-.2) + .2/8) = 1.21x theoretical maximum application speedup Using Amdahl’s Law in this way can help you compare options and determine where to spend your tuning effort. It is from Amdahl’s Law that Hotspots-based tuning is derived. The formula would tell us that the payoff is going to be insignificant when you tune parts of your application that are not taking a significant fraction of total CPU time. So – Always tune in your hotspots, and use Amdahl’s formula to maximize your efforts!

Sweet 16?

I just saw the article ” AMD calls end to core growth on server chips ” at Techworld.com. The gist of the article is that AMD has decided to produce server chips with no more than 16 cores. There were some interesting future directions outlined and hinted at by the end of the article, too. What seemed most disturbing to me was the limit on the number of cores being self-inflicted. Surely we can’t have reached the maximum number of cores that are possible to squeeze onto a chip? The whole “right turn” idea to add cores rather than try to cool processors reaching rocket engine temperatures was less than 10 years ago. I’m not sure where the physics starts to overshadow Moore’s Law, but I thought I’d  heard that a few more generations of smaller wire sizes in processor dies were still possible. So why not push more and more cores into the same package? It might be that the average server application (and, perhaps even more so, consumer applications) can’t scale well beyond some fixed number of cores. How many cores does it take to type and post a tweet or update your Facebook status or to watch a streaming video? Would any of those tasks be faster or somehow enhanced if there were twice the number of cores available? If we stop increasing the core counts in the next 5 years, how will new chips keep fulfilling the ever-growing hunger for more performance by consumers? Maybe it won’t be about faster and faster application exeuction, but more about less energy consumption while maintaining a level of performance. I guess at some point we’ll stop being concerned about Gigahertz or core counts because all processors will be able to do many of the same tasks in about the same amount of time. I do know that power consumption is going to be a major driving design force as HPC moves closer toward Exascale platforms.  Thus, if the THX-1138 processor draws power twice as fast as the CFM602 processor, I would be more likely to build my system equipped with the former.

Coarse-grained locks and Transactional Synchronization explained

Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog. In my blog “ Transactional Synchronization in Haswell ,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and the concept of transactions are both key to understanding why Intel TSX matters. Intel TSX may enhance performance of mutual exclusion other than simple coarse-grained locks, but I will focus on coarse-grained locking because it is common and Intel TSX allows highly concurrent accesses using only a simple locking mechanism. An example To motivate by illustration, let’s consider a simple hash table. Hash tables are used to map a key to a key and value pair in linear time. Two key operations are add (insert) and lookup (retrieve). Resizing and deletion are two additional operations of general interest also, but I will leave them for another time. Designing a highly concurrent hash table is a non-trivial task, and there are many approaches to allow high levels of concurrency.  All these approach add complexity to the program, and often to the data structures themselves. The simplest approach is a single lock approach. In such an approach, every operation on the hash table starts by obtaining the lock for the table and concludes by releasing the lock. While the lock is held for the operation, no other task on the system can obtain the lock and therefore no hash table operation is allowed to proceed. Considering Figure 1, no concurrent operations are allowed, so each of the five operations shown would occur one at a time. Figure 1: Five hash table operations requested Solutions A common solution is to break the hash table into smaller regions, and have locks that apply to regions. While this can reduce contention, it still can create needless delays and it definitely complicates the coding and the data structure. Such an approach is a prime example of taking a coarse-grained lock (a single lock for the entire hash table) and working to make it a finer grained lock (multiple locks for smaller table sections). Coarse-grained locks are easier to use, easier to understand and easier to debug.  The only disadvantage is that they tend to impede performance in a multithreaded environment. Multicore processors are increasing the likelihood of this being a problem, and help motivate new hardware assistance so that programming has a chance to stay simple more often than without assistance. Transactional Synchronization (Intel TSX) as a solution What would be ideal, is to use the single lock (coarse-grained locking) because it is easy and not very error prone, but still have the performance of a fine-grained implementation. In our Figure 1 example, only one operation conflicts with another. This example does have more conflicts that would be expected in a real world example. Considering this example, three of the operations have no collision with the other operation so the use of HLE (part of Intel TSX) on the single lock will completely elide the lock. In other words, the performance is very close to the performance of the code if no locking or unlocking code was present. The key however is that the operations are protected by the Intel TSX hardware, which has silently ensured that the protection intended by the lock is indeed assured. The two operations that map to the same hash table entry will need to be staggered. This will occur even if we are unlucky enough to have them happen at the same time. In such a case, the Intel TSX will detect that the lock was indeed needed and some locking overhead will be incurred. What would actually happen in such a case, is that the colliding tasks will proceed into the protected code until the processor detects the conflict. As such a point, both updates will abort their protected code (also called the transaction). The most common solution then is to have each task proceed but actually enforce the lock on the second try. This means that one task will win, and delay the other, until the operation is complete. The precise decision on how to handle the collision is either up to the processor implementation with HLE, or the programmer with RTM. The processor implementation for HLE will also be fairly simple and conservative, in order to preserve the semantics of the original lock and hence compatibility with processors that lack Intel TSX. Summary For a hash map, Intel TSX allows for the right things to occur without losing the protection that the locks need to give. Intel TSX ensures the same results as the coarse-grained lock guarantees, but allows unrelated operations to proceed without delays that the coarse-grained locks would have caused. For more information on Transactional Synchronization, see my blog on Intel TSX . Please check out the specification and stay tuned for information about supporting tools from Intel and others in the coming months.

Myths about static analysis. The fifth myth – a small test program is enough to evaluate a tool.

While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The fifth myth: “You can easily evaluate capabilities of a static analyzer on a small test code”. This is how this statement looks in discussions on forums (this is a collective image): I’ve written a special program, its size is 100 code lines. But the analyzer doesn’t generate anything although all the warning levels are enabled. This [tool of yours] / [static analysis] in general is just rubbish. It is not the static analysis methodology which is rubbish, but this approach to evaluating the usability of a particular tool. The incorrectness of this kind of tool studying consists of two aspects: 1. Programmers think they don’t make simple mistakes. This phenomenon was discussed in Myth 2 . So they try to feed an analyzer with a tricky sample and feel happy secretly when the analyzer can’t find the error. This game is interesting yet senseless. You should understand that most errors are simple as hell, and static analyzers detect them very well. The paradox is that it’s much more difficult to invent a simple mistake than a complicated one. Here you are an example. Can you ever guess to write a sample like this? int threadcounts[] = { 1, kNumThreads }; for (size_t i = 0; i < sizeof(threadcounts) / sizeof(threadcounts); i++) { I doubt. I cannot imagine one can make such a silly mistake and write “sizeof(threadcounts) / sizeof(threadcounts)”. So, such an example will never be created on purpose. By the way, this fragment is taken not from a student’s lab work, but from the Chromium project. It is diagnosed by the PVS-Studio analyzer very easily, of course. 2. Written samples are of random character, and they are few. So you may get very different results depending on chance. You may invent 5 errors that will be successfully found by one analyzer and not found by another analyzer. Or you may create a program with five errors, and two analyzers will give opposite results for it. The sampling for such an investigation is too small. To be able to compare and study tools with at least somewhat reliable results, you must write a program text with at least 500 different errors. An investigation based on 5-10 errors is not reliable. Moreover, programmers expect to see diagnostic messages on errors of some particular type and forget about the rest. For example, almost all the programmers write one and the same sample with a memory release defect: void Foo() { int *a = (int *)malloc(X); int *b = (int *)malloc(Y); //… free(a); } Some analyzers detect this error, the others don’t. For instance, PVS-Studio does not diagnose memory leaks currently. But it can find the following stuff: static int rr_cmp(uchar *a,uchar *b) { if (a[0] != b[0]) return (int) a[0] – (int) b[0]; if (a[1] != b[1]) return (int) a[1] – (int) b[1]; if (a[2] != b[2]) return (int) a[2] – (int) b[2]; if (a[3] != b[3]) return (int) a[3] – (int) b[3]; if (a[4] != b[4]) return (int) a[4] – (int) b[4]; if (a[5] != b[5]) return (int) a[1] – (int) b[5]; if (a[6] != b[6]) return (int) a[6] – (int) b[6]; return (int) a[7] – (int) b[7]; } There must be “return (int) a[5] – (int) b[5];” instead of “return (int) a[1] – (int) b[5];”. Why does nobody write such examples? Note that PVS-Studio has found this error in the MySQL project. The conclusion is, adequate investigation or comparison of tools can be carried out only with real projects. You take project A, test it with PC-Lint / Visual C++ / PVS-Studio / C++Test, study all the messages attentively, draw up a table of results (how many and which errors each analyzer has found). This is the only real investigation and comparison. For example: ” Comparing the general static analysis in Visual Studio 2010 and PVS-Studio by examples of errors detected in five open source projects “.

Myths about static analysis. The fourth myth – programmers want to add their own rules into a static analyzer.

While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The fourth myth is: “A static analyzer must enable users to add user-made rules. Programmers want to add their own rules.” No, they don’t. They actually want to solve some tasks of searching for particular language constructs. It is not the same thing as creating diagnostic rules. I have always answered that implementation of own rules is not the thing programmers actually want. And I never saw any other alternative than implementing diagnostic rules by the analyzer’s developers at the request of programmers ( an article on the subject ). I have had a fruitful conversation with Dmitry Petunin recently. He is the director of an Intel department of compiler testing and software verification tool development. He enlarged my understanding of this subject and voiced the idea I had been pondering over but failed to give the final formulation of. Dmitry confirmed my belief that programmers wouldn’t write diagnostic rules. The reason is very simple – it is very hard. Some static analysis tools enable users to extend the rule set. But it is done rather as a pure formality or for convenience of the tool’s developers themselves. You need to know the subject very deeply to be able to develop new diagnostic rules. If an enthusiast without skill starts creating them, they will be of little use. My understanding of the issue was over at this point. Dmitry, being more skilled than I, helped me learn more. In brief, this is how the situation looks. Indeed, programmers want to be able to search for some particular patterns/errors in their code. They really need it. For example, someone needs to find all the explicit conversions of the int type to float. This task cannot be solved by such tools as grep, since it is unknown what type the FOO() function will return in a “float(P-> FOO())” -like construct. At this moment the programmer comes to the idea that he/she can implement search of such constructs by adding his/her own check into the static analyzer. This is where the key point lies. The programmer does not need to create his/her analysis rules. He/she needs to solve a particular issue. What he/she wants is a very small task from the viewpoint of static analysis mechanisms. It is like using a car to light cigarettes with its cigarette lighter. That’s why both Dmitry and I don’t support the idea of providing users with API to handle the analyzer. It is an extremely difficult task from the viewpoint of development. Besides, people will hardly use more than 1% of it. So, it’s irrational. It is easier and cheaper for a developer to implement users’ wishes than create a complex API for add-ons or a special language of rule description. The readers may say: “then make only 1% of the functionality in API available, and everyone will be happy”. Yes, right. But look where the emphasis has moved: from developing own rules we have come to the idea that we just need a tool similar to grep but possessing some additional information about program code. There is no such a tool yet. If you want to solve some task, write to me, and we will try to implement it in the PVS-Studio analyzer. For example, we have recently implemented several requests on searching for explicit type conversions: V2003 , V2004 , V2005 . It is much easier for us to implement such wishes than create and maintain an open interface. It’s also easier for users themselves. By the way, such a tool might appear some time later within the scope of Intel C++. Dmitry Petunin said they had discussed a probability of creating a grep-like tool possessing knowledge about code structure and variable types. But it was discussed just in theory. I don’t know whether or not they really intend to create this tool.

Myths about static analysis. The third myth – dynamic analysis is better than static analysis.

While communicating with people on forums, I noticed there are a few lasting misconceptions concerning the static analysis methodology. I decided to write a series of brief articles where I want to show you the real state of things. The third myth is: “Dynamic analysis performed by tools like valgrind for C/C++ is much better than static code analysis”. The statement is rather strange. Dynamic and static analyses are just two different methodologies which supplement each other. Programmers seem to understand it, but I hear it again and again that dynamic analysis is better than static analysis. Let me list advantages of static code analysis. Diagnostics of all the branches in a program Dynamic analysis in practice cannot cover all the branches of a program. After these words, fans of valgrind tell me that one should create appropriate tests. They are right in theory. But anyone who tried to create them understands how complicated and long it is. In practice, even good tests cover not more than 80% of program code. It is especially noticeable in code fragments handling non-standard/emergency situations. If you take an old project and check it with a static analyzer, most errors will be detected in these very places. The reason is that even if the project is old, these fragments stay almost untested. Here is a brief example to show you what I mean (FCE Ultra project): fp = fopen(name,”wb”); int x = 0; if (!fp) int x = 1; The ‘x’ flag will not be equal to one if the file wasn’t opened. It is because of such errors that something goes wrong in programs: they crash or generate meaningless messages instead of adequate error messages. Scalability To be able to check large projects through dynamic methods regularly, you have to create a special infrastructure. You need special tests. You need to launch several instances of an application in parallel with different input data. Static analysis is scaled several times easier. Usually you need only a multi-core computer to run a tool performing static analysis. Analysis at a higher level One of the advantages of dynamic analysis is that it knows what function and with what arguments is being called. Consequently, it can check if the call is correct. Static analysis can’t know it and can’t check arguments’ values in most cases. This is a disadvantage of this method. But static analysis performs analysis at a higher level than dynamic analysis. This feature allows a static analyzer to detect issues which are correct from the viewpoint of dynamic analysis. Here is a simple example (ReactOS project): void Mapdesc::identify( REAL dest[MAXCOORDS][MAXCOORDS] ) { memset( dest, 0, sizeof( dest ) ); for( int i=0; i != hcoords; i++ ) dest[i][i] = 1.0; } Everything is good here from the viewpoint of dynamic analysis, while static analysis gives the alarm because it is very suspicious that the number of bytes being cleared in an array coincides with the number of bytes the pointer consists of. Here you are another example from the Clang project: MapTy PerPtrTopDown; MapTy PerPtrBottomUp; void clearBottomUpPointers() { PerPtrTopDown.clear(); } void clearTopDownPointers() { PerPtrTopDown.clear(); } Is there anything here dynamic analysis may find suspicious? Nothing. But a static analyzer can suspect there is something wrong. The error is this: inside clearBottomUpPointers() there must be this code: “PerPtrBottomUp.clear();”.

Parallel Programming is easier than separating 2 corks

I’ve known Prof. Tom Murphy for a few years now. Whenever we were at a conference or other event together and had dinner, he invariably would ask the wait staff if they had two corks he could have. If the place served wine, it wasn’t too difficult to find two corks that were the same size or close. Upon receiving the corks, Tom would demonstrate his “cork trick” and mystify everyone that had not seen it before. After going through it three or four times he would hand the corks back to our server and have them try to do it. They would go away, sometimes showing others, as they struggled to figure out the trick. If they actually tried to recreate the solution as Tom had been able to do, they always came back before we had paid the check and triumphantly demonstrated their dexterity. At SC11, the Educational Alliance for a Parallel Future ( EAPF ) commissioned some corks with the organization’s logo. Tom wandered around part of the conference urging attendees to try his cork trick. The tagline he used was that “Parallel Programming is easier than the cork trick.” You can see a short video of his efforts to bring a little magic to the SC11 proceedings here . If you meet Tom with some corks in his pocket and he brings them out to show you the cork trick, be aware that he will never show you the solution. (He says he really likes me, but I had to figure it out for myself.) Like most problems you encounter in life, very few are impossible to solve; it is just that you don’t have a solution, yet. Parallel programming is the same. It may seem difficult and impossible to figure out, but that only means you haven’t discovered the key that will allow you to wrap your brain around the concepts.