8 pages

Analyzing Parallel Programs with Pin

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Analyzing Parallel Programs with Pin
  COMPUTER  56 COVER FEATURE Published by the IEEE Computer Society 0018-9162/10/$26.00 © 2010 IEEE   Instrumentation is one tool for collecting the information needed to understand programs. Instrumentation- based tools typically insert extra code into a program to record events during execution. 1-4  The cost of executing the extra code can be as low as a few cycles, enabling fine-grained observation down to the instruction level. Pin 2  ( is a software system that per-forms runtime binary instrumentation of Linux and Microsoft Windows applications. Pin’s aim is to provide an instrumentation platform for building a wide variety of program analysis tools, called  pintools . By performing the instrumentation on the binary at runtime, Pin eliminates the need to modify or recompile the application’s source and supports the instrumentation of programs that dy-namically generate code. INSTRUMENTATION Pin provides a platform for building instrumentation tools. A pintool consists of instrumentation, analysis, and callback routines. 1    Instrumentation routines  inspect the application’s instructions and insert calls to analysis routines. Analysis routines  are called when the program executes an instrumented instruction and often perform ancillary tasks. The program invokes callbacks  when an event occurs, for example, when it is about to exit. Figure 1 shows a simple pintool that prints the memory addresses of all data a program reads or writes. Instruc- A   decade ago, systems with multiple proces-sors were expensive and relatively rare; only developers with highly specialized skills could successfully parallelize server and scientific applications to exploit the power of multipro-cessor systems. In the past few years, multicore systems have become pervasive, and more programmers want to employ parallelism to wring the most performance out of their applications. Exploiting multiple cores introduces new correctness and performance problems such as data races, deadlocks, load balancing, and false sharing. Old problems such as memory corruption become more difficult because par-allel programs can be nondeterministic. Programmers need a deeper understanding of their software’s dynamic behavior to successfully make the transition from single to multiple threads and processes. Software instrumentation provides the means to collect information on and effi-ciently analyze parallel programs. Using Pin, developers can build tools to detect and examine dynamic behavior including data races, memory system behavior, and parallelizable loops. Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, and Ady Tal, Intel ANALYZING PARALLEL PROGRAMS WITH PIN  #include <stdio.h>#include “pin.H”FILE trace;* VOID Address(VOID * addr) { fprintf(trace,”%p\n”, addr); }VOID Instruction(INS ins, VOID *v) { if (INS_IsMemoryRead(ins)) { INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(Address), IARG_MEMORYREAD_EA, IARG_END); } if (INS_IsMemoryWrite(ins)) { INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(Address), IARG_MEMORYWRITE_EA, IARG_END); } }VOID Fini(INT32 code, VOID *v) { fclose(trace); }int main(int argc, char *argv[]) { PIN_Init(argc, argv); trace = fopen(“pinatrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } 57 MARCH 2010 In Pin’s high-performance  probe mode  option, the base overhead is near zero. The probe mode has a limited set of callbacks available and restricts tools to interposing wrapper routines for global functions. Figure 2 shows a pintool’s fragment that wraps calls to malloc  so it can print the argument and return values. Image  is an instrumentation routine that the program invokes every time a binary or shared library loads. It searches the binary for a function called malloc  and replaces it with a call to malloc_wrap . When the pro- tion  is an instrumentation routine that Pin calls the first time the program executes an instruction, so the routine can specify how it should be instrumented. If the instruc-tion reads or writes memory, this example pintool inserts a call to Address —an analy-sis routine—and directs Pin to pass it the memory reference’s effective address. Immediately before a memory reference executes, the program calls Address , which prints the address to a file. The program in-vokes a callback routine, Fini , when it exits. Instrumentation and callback routines are registered in the pintool’s main  function. Figure 1 demonstrates only a small part of the Pin API. Whereas the example uses an instrumentation routine that can only see a single instruction at a time, Pin lets instru-mentation routines see instruction blocks or whole binaries. The argument to Address  is an effective address, but Pin provides much more, including register contents (for ex-ample, value of R9 ), the instruction pointer (IP or PC), procedure argument values, and constants. The only callback used in the example is for program end, but Pin also provides callbacks to notify a pintool about shared library loads, thread creation, system calls, Unix signals, and Microsoft Windows exceptions. Although the instrumentation in this example is very simple, it is sufficient for a variety of useful tools. Instead of writing addresses to a file, a tool could feed the ad-dresses to a software model of a cache and compute the cache miss rate for the appli-cation. By watching all the references to a specific memory location, it is possible to find an erroneous write through a pointer that overwrites a value with 1/100th the overhead of doing the same analysis in a debugger.Pin uses a just-in-time (JIT) compiler to insert instrumentation into a running application. The  JIT compiler recompiles and instruments small chunks of binary instructions immediately prior to executing them. Pin stores the modified instructions in a software code cache where they execute in lieu of the srcinal applica-tion instructions. The code cache allows Pin to generate code regions once and reuse them for the remainder of program execution, amortizing compilation costs. Pin’s average base overhead is 30 percent, and user-inserted instrumentation adds to the time. Figure 1.  Pintool for printing all program memory read and write addresses. typedef void (*malloc_funptr_t)(size_t size); malloc_funptr_t app_malloc; VOID * malloc_wrap(size_t size) { void * ptr = app_malloc(size); printf(\”Malloc %d return %p\”, size, ptr); return ptr; } VOID Image(IMG img, VOID *v) { RTN mallocRtn = RTN_FindByName(img, “malloc”); if (RTN_Valid(mallocRtn)) { app_malloc= (malloc_funptr_t)RTN_ReplaceProbed(mallocRtn,AFUNPTR(malloc_wrap)); }} Figure 2. Pintool’s fragment for wrapping malloc .  COVER FEATURE COMPUTER  58 separate output file for each thread and retrieve the file descriptor from thread-local storage. Performance considerations Correcting a parallel program by adding locks is usually straightforward. However, a highly contended lock serializes execution and leads to poor CPU utilization. Because applica-tion threads execute analysis routines, a highly contended lock in an analysis routine will also serialize the application’s execution. The serialization increases the pintool’s over-head when compared to the application’s uninstrumented execution and might alter the parallel program’s behavior drastically. Pintool authors must employ standard parallel programming techniques to avoid excessive serialization. They should use thread-local storage to avoid the need to lock global storage. Instead of a single monolithic lock for a data structure, they should use fine-grained locks.  False sharing  is another pitfall in naïve pintools, occurring when multiple threads access different parts of the same cache line and at least one of them is a write. To maintain memory coherency, the computer must copy the memory from one CPU’s cache to another, even though data is not truly shared. False sharing is less costly when CPUs operate out of a shared cache, as is true for the four cores in the Intel Core i7 processor. Developers can eliminate false sharing by padding critical data structures to the size of a cache line or rearranging the structures’ data layout. Multithreaded versus multiprocess instrumentation Pin allows instrumentation of parallel programs that use multiple threads and multiple cooperating processes. The new thread executes the same instrumented code as the other threads and accesses the same data. When a pro-gram spawns a new process or a process exits, Pin notifies the pintool. The pintool can choose to let the new process execute natively or under its control. The new process will have new code that the pintool must reinstrument. The processes do not share pintool data; however, a pintool can use OS-provided mechanisms for communication between the parallel program’s instrumented processes. EXAMPLE TOOLS Developers can use various Pin-based tools to analyze parallel program performance and correctness. Intel Parallel Inspector The Intel Parallel Inspector ( analyzes the multithreaded programs’ execution to find memory and threading errors, such as memory leaks, references to uninitialized data, data races, and deadlocks. Intel Parallel Inspector uses Pin to instrument the running program and collect the information necessary to detect errors. gram calls malloc , malloc_wrap  is called instead, which calls the application malloc , then prints the argument and return value. To avoid infinite recur-sion, the call to malloc  from malloc_wrap  should not be redirected, so we instead call the function pointer returned by RTN_ReplaceProbed . The data collected from this tool could be used to find a program that in-correctly freed the same memory twice or track down some code that allocated too much memory.In probe mode, the program binary is modified in memory. Pin overwrites the entry point of procedures with jumps (called probes) to dynamically generated in-strumentation. This code can invoke analysis routines or a replacement routine. When the replacement routine needs to invoke the srcinal function, it calls a copy of the entry point (without the probe) and continues executing the srcinal program. Instrumenting parallel programs Instrumenting a parallel program is not very different from instrumenting single-threaded programs. Pin provides callbacks when a new thread or new process is created. Analysis routines can be passed a thread ID so it is possible to attribute recorded data—for example, a memory refer-ence address—to the thread that performed the operation. Instrumenting a multithreaded program does require some special care by the tool writer. When a pintool instruments a parallel program, the application threads execute the calls to analysis functions. If the pintool in Figure 1 is invoked on a multithreaded program, then all the application threads can call the Address  function simultaneously. The pintool author is responsible for making the analysis functions thread-safe so they can be applied to a multithreaded program. Writing a thread-safe analy-sis routine is similar to writing a thread-safe routine in a multithreaded program. Authors use locks to synchronize references to shared data with other threads. Pin also provides APIs for allocating and addressing thread-local storage. For example, the Address  function in Figure 1 writes the program address to a file. The trace  variable points to a FILE  descriptor, which all threads share. It is not safe for multiple threads to write to FILE  simultaneously. To enable this pintool to correctly instru-ment a multithreaded program, the Address  function must either have a lock around the call to fprintf  or create a In probe mode, Pin overwrites the entry point of procedures with  jumps to dynamically generated instrumentation.  59 MARCH 2010 and record the wait time, synchronization object, and call stack. Because the Intel Parallel Amplifier only instruments synchronization routines and not every call and return, it cannot maintain a shadow call stack. Instead, the in-strumentation unwinds the stack every time it needs to capture a call stack. Intel Trace Analyzer and Collector The Intel Trace Analyzer and Collector provides infor-mation critical to understanding and optimizing cluster performance by quickly finding performance bottlenecks with Message Passing Interface (MPI) communication. The tool presents a timeline showing when MPI mes-sages are sent and received, and programmers can use this information to improve the CPU utilization. The Intel Trace Analyzer and Collector uses Pin’s probe mode to instrument calls to the MPI library, collecting time stamps, arguments, and other data. If a user requests call stack information, a JIT-mode tool instruments call and return instructions to maintain a shadow stack. CMP$im Memory system behavior is critical to parallel program performance. Computational bandwidth increases faster than memory bandwidth, especially for multicore systems. Programmers must utilize as much bandwidth as possible for programs to scale to many processors. Hardware-based monitors can report summary statistics such as memory references and cache misses; however, they are limited to the existing cache hierarchy and are not well suited for collecting more detailed information such as the degree of cache line sharing or the frequency of cache misses because of false sharing. CMP$im 6  uses Pin to collect the memory addresses of multithreaded and multiprocessor programs, then uses a memory system’s software model to analyze program behavior. It reports miss rates, cache line reuse and shar-ing, and coherence traffic, and its versatile memory system model configuration can predict future systems’ applica-tion performance. While CMP$im is not publicly available, the Pin distribution includes the source for a simple cache model, dcache.cpp . PinPlay Debugging and analyzing parallel programs is difficult because their execution is not deterministic. The threads’ A  data race  occurs when two threads access the same data, at least one access is a write, and there is no syn-chronization (for example, locking) between accesses. 5  Unsynchronized variable writes usually are a program-ming error and can cause nondeterministic behavior. To detect data races, Parallel Inspector uses Pin to in-strument all machine instructions in the program that reference memory and records the effective addresses (similar to Figure 1). It also instruments calls to thread syn-chronization APIs. By examining the effective addresses, Intel Parallel Inspector can detect when multiple threads access the same data. The synchronization API’s instru-mentation lets Intel Parallel Inspector determine if the memory accesses were synchronized. To help the pro-grammer identify the cause of the data race, Intel Parallel Inspector shows the source lines and the call stacks lead-ing to the problematic memory references. Pin provides an API for mapping a machine instruction address to the corresponding source line and file. A debug-ger provides a call stack by unwinding stack frames and recovering the procedure call return addresses from the stack. Error-checking tools need to record the call stack for every memory reference because the tools might not de-termine until later whether that reference caused an error. Unwinding the call stack for every reference is expensive. Instead, tools typically keep a shadow call stack. A pintool instruments all call instructions, saving the stack pointer’s current value and the called procedure on the shadow stack. Procedure return instructions are also instrumented, popping off enough shadow stack frames to resynchronize with the stack pointer’s current value. Intel Parallel Amplifier The Intel Parallel Amplifier ( performs three types of anal-ysis to help programmers improve program performance: hotspots, concurrency, and locks and waits.  Hotspots  at-tribute time to source lines and call stacks, identifying the parts of the programs that would benefit from tuning and parallelism. Concurrency  measures the CPUs’ utilization, giving whole program and per-function summaries.  Locks and waits  measures the time multithreaded programs spend waiting on locks, attributing time to synchronization objects and source lines. Identifying locks responsible for wait time and the associated source lines helps program-mers improve a parallel program’s CPU utilization. Hotspot and concurrency analysis data comes from sampling. Intel Parallel Amplifier uses Pin to instrument the application to collect data for the locks and waits analy-sis. Capturing accurate timing data requires low overhead instrumentation. The locks and waits analysis uses Pin’s probe mode to replace calls to synchronization APIs with wrapper functions, as Figure 2 demonstrates. The wrap-per functions call the srcinal synchronization function Computational bandwidth increases faster than memory bandwidth, especially for multicore systems.  COVER FEATURE COMPUTER  60 instruction granularity, which is insufficient for parallel programming because many programs are parallelized at the loop level. To meet this need, developers created the Pin-based Prospector tool, 10  which discovers potential parallelism in serial programs by loop  and data- dependence profil-ing . Prospector provides loop execution profiles such as trip counts and the number of instructions executed inside loops. It also dynamically detects loop-carried data dependencies, which must be preserved during the parallelization process. Programmers receive reports on candidate loops for parallelization and can manu-ally parallelize them with systems such as OpenMP ( and Threading Building Blocks. 11  In addition to the profiler, Prospector provides several tools for visualizing and interpreting the profiling results. Figure 3 shows the results of applying Prospector to the cactusADM program in the SPEC2006 benchmark suite. Figure 3a is the call graph displayed by Prospector. One of the functions, CCTKi_ScheduleTraverseGHExtensions , is highlighted because it contains a parallelizable loop. Figure 3b is this function’s loop graph. Intel Software Development Emulator The Intel Software Development Emulator, or Intel SDE (, is a user-level functional emulator for new instructions in the Intel64 instruction set built on Pin. Intel SDE supports emulation and debug-ging of multithreaded programs that use the Intel AVX (, AES, and SSE4 instruction set extensions.Whereas most tools use Pin to observe a program’s ex-ecution, Intel SDE uses Pin to alter the program while it is running. During instrumentation, it deletes all instructions that must be emulated and replaces them with calls to functions that emulate the instruction. relative progress can change in every run of the program, possibly changing the results. Even single-threaded pro-gram execution is not deterministic because of behavior changes in certain system calls (for example, gettimeof-day() ) and stack and shared library load locations. PinPlay is a Pin-based system for user-level capture and deterministic replay of multithreaded programs under Pin. The program first runs under the control of a Pin-based logging tool, which captures all the system call side effects 7  and inter-thread shared-memory dependencies. 8  Another Pin-based tool can replay the log, exactly reproducing the recorded execution by loading system call side effects and possibly delaying threads to satisfy recorded shared-memory dependencies. Replaying a previously captured log by itself is not very useful. A pintool that instruments a program execution can also instrument a PinPlay log replay. The tool run-ning off a PinPlay log sees the same program behavior on multiple runs, making the analysis deterministic. The program can also replay a PinPlay log while connected to a debugger, making multithreaded program debugging deterministic. As long as the PinPlay logger can capture a bug once, the behavior can repeat exactly multiple times with replay under a debugger. Future releases of Pin will include PinPlay. Prospector Compilers are ideal tools for exploiting parallelism because they can potentially perform automatic paral-lelization. However, even state-of-the-art compilers miss many parallelization opportunities in C/C++programs, and as a result, programmers are forced to manually paral-lelize applications. The success of manual parallelization relies on execution profiler quality. Unfortunately, popu-lar execution profilers, such as Gprof  9  and Dev 8 Partner (, profile programs at function or bench_staggeredleapfrog2_<4083f0> Util_StrCmpi<42e8c0> CCTK_GroupIndex<41fcc0> CCTK_VarIndex<41fa90> CCTKi_TriggerSaysGo<434ba0>CCTKi_ScheduleTraverseGHExtensions<42c180>:587bench_staggeredleapfrog1a_ts_<407730> LapseGaussian<452590> InitialFlat<450790> CartGrid3D<44d940>bench_staggeredleapfrog2_<4083f0>regex_compile<412f00>PUGH_ReductionMaxVal<4552d0>Util_StrCmpi<42e8c0>CCTK_VarIndex<41fa90>STR_cmpi<438b00>IOBasic_WriteInfo<481a30>PUGH_ReductionGVs<457b80>CCTK_GroupIndex<41fcc0>re_compile_fastmap<415de0>bench_staggeredleapfrog1a_ts_<407730>CCTKi_TriggerSaysGo<434ba0>PUGH_ReductionMinVal<455f50>re_search_2<418d10>re_match_2_internal<416510>CCTKi_ScheduleTraverseGHExtensions<42c180>LapseGaussian<452590>InitialFlat<450790> CartGrid3D<44d940>ParameterSetSimple<431c90> (b)(a) Figure 3. Visualization of Prospector’s results: (a) call graph and (b) loop graph.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!