%% --------------------------------------------------------------  
%% (C)Copyright 2006,2007,                                         
%% International Business Machines Corporation                     
%% All Rights Reserved.                                            
%%                                                                 
%% Redistribution and use in source and binary forms, with or      
%% without modification, are permitted provided that the           
%% following conditions are met:                                   
%%                                                                 
%% - Redistributions of source code must retain the above copyright
%%   notice, this list of conditions and the following disclaimer. 
%%                                                                 
%% - Redistributions in binary form must reproduce the above       
%%   copyright notice, this list of conditions and the following   
%%   disclaimer in the documentation and/or other materials        
%%   provided with the distribution.                               
%%                                                                 
%% - Neither the name of IBM Corporation nor the names of its      
%%   contributors may be used to endorse or promote products       
%%   derived from this software without specific prior written     
%%   permission.                                                   
%%                                                                 
%% THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND          
%% CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,     
%% INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF        
%% MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE        
%% DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR            
%% CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    
%% SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT    
%% NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;    
%% LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)        
%% HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN       
%% CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR    
%% OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,  
%% EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.              
%% --------------------------------------------------------------  
%% PROLOG END TAG zYx                                              

Summary: DMA Microbenchmarks

Target: CBE/Linux

Description:

        The DMA Microbenchmarks measure the performance of a representative set
        of DMA operations and report the results.  The intent of these
        microbenchmarks is to guide applications developers in the design,
        development, and performance analysis of applications for systems based
        on the Cell Broadband Engine processor.

        The DMA operations that are currently measured are:

          * Sequential DMAs of various sizes, to memory or to LS of another SPE
          * List-form DMAs for various numbers and sizes of elements, to memory
            or to LS of another SPE
          * Aggregate bandwidth measures for concurrent DMA operations by 
            multiple SPEs

        Command line options allow the user to specify certain parameters for
        the operations performed, such as DMA size.  The benchmark program is
        structured to allow additional operations to be easily added either by
        users or in subsequent SDKs.

How to run:

        The DMA Microbenchmarks are packaged as a single program named 
        dmabench. The usage for the dmabench program is:

            dmabench [options] <benchmark>

        Valid options include:

            --affinity - specifies logical affinity should be used to ensure
                the threads are scheduled on the SPEs with close proximity
                to the SPEs in which they communicate. This option only 
                affects benchmarks that target the local store of another SPE.
                The default is no affinity.

            --entrysize n - specifies n bytes as the size of the data
                transferred for each DMA list entry in the dmalist benchmark.
                The default is 128 bytes. Valid values are from 8 bytes to 16K.

            --help - specifies to display a help message and exit

            --maxsize n - specifies n bytes as the largest DMA transfer to be
                performed in the execution of the benchmark.  The default value
                is 16K bytes for the sequential DMA benchmarks and the size of
                2048 list entries for the DMA list benchmarks.  Valid values 
                are from 8 to 16K.

            --minsize n - specifies n bytes as the smallest DMA transfer to be
                performed in the execution of the benchmark.  The default value
                is 8 bytes for the sequential DMA benchmarks and the size of 
                one list entry for the DMA list benchmarks.  Valid values are
                from 8 to 16K.

            --numreqs n - specifies the number of requests issued in sequence
                within the timing window. The SPEs wait for DMA completion only
                after issuing all the requests. The default is a single 
                request. Valid values are from 1 to 32 requests.

            --numspes n - specifies that n SPEs should concurrently execute 
                the benchmark.  The default is to execute the benchmark on a 
                single SPE. When the benchmark is executed on more than one 
                SPE, the SPEs are synchronized so that the benchmark code 
                starts at roughly the same time on all SPEs.

	    --offset - specifies that the starting address for system memory 
	        buffers should be offset by 128 bytes (one cache line) to 
                distribute accesses across memory banks.

            benchmark - is the string name of the benchmark to be executed. The
                following benchmarks have been implemented:

                seqdma[l]{r|w|rw}: sequential DMAs, where "l" indicates the DMA
                       targets the local store of another SPE, and "{r|w|rw}"
                       indicates the type of access to be performed -- read, 
                       write, or a read and write performed concurrently.

                dmalist[l]{r|w|rw}: list-form DMAs, where "l" indicates the DMA
                       targets the local store of another SPE, and "{r|w|rw}"
                       indicates the type of access to be performed -- read, 
                       write, or a read and write performed concurrently.

        Normally the benchmark performs the DMA operation for a range of 
        transfer sizes and the timing results for all sizes is reported at the
        end of the run.

Output:

        Normally the benchmark performs the DMA operation for a range of 
        transfer sizes and reports the timing results for all sizes at the 
        end of the run.

        The program reports the following results for each of the DMA sizes:

        ticks     - average number of decrementer (timebase) timer ticks 
                    for the requested number of DMAs.
        pclocks   - average number of processor clocks (assuming a 3.2 GB 
                    processor clock) for the requested number of DMAs.
        microsecs - averate number of micro seconds for the requested number
                    of DMA.
        aggr GB/s - aggregated DMA throughput across all SPEs, expressed in
                    gigabytes per second.

    
        The following illustrates the output from the dmabench program.

        Time base frequency measured as = 14.318 MHz
        All SPEs completed successfully!
        dmabench results: seqdmar numspes=1 numreqs=1 entrysize=128
           dma_size         ticks       pclocks     microsecs     aggr GB/s
        ------------  ------------  ------------   -----------  ------------
                  8           2.8           621          0.19        0.0412
                 16           2.7           610          0.19        0.0839
                 32           2.7           610          0.19        0.1678
                 64           2.7           612          0.19        0.3344
                128           2.7           610          0.19        0.6713
                256           2.8           623          0.19        1.3138
                512           3.0           659          0.21        2.4850
               1024           3.2           726          0.23        4.5113
               2048           3.8           849          0.27        7.7166
               4096           6.3          1410          0.44        9.2942
               8192          11.3          2525          0.79       10.3799
              16384          21.2          4738          1.48       11.0654

        In this output:

            dma_size indicates the amount of data transferred in each request.

            ticks, pclocks, and microsecs all indicate the measured latency of
                the requested DMA operation, just using different units.  Only
                the value of ticks is actually measured -- the other values are
                derived from the ticks value.

            ticks is the number of decrementer ticks that elapsed between the
                initiation of the DMA request and notification that the request
                had completed.  The time base frequency displayed at the top
                of the output indicates the frequency of decrementer ticks. 

            pclocks is the approximate latency of the DMA operation in 3.2 GHz 
                processor clock cycles.

            microsecs is the approximate latency of the DMA operation in
                microseconds.

            aggr GB/s is the aggregate data transfer rate for the DMA operation
                based on the dma_size and measured latency across all SPEs that
                executed the benchmark.

Notes:

        In systems with 2 Cell Broadband Engine processors (such as the IBM
        QS20), the performance of DMA operations is heavily dependent on the
        location of memory banks being accessed relative to the process on
        which the test is performed.  This is because such systems have a
        Non-Uniform Memory Access (NUMA) memory subsystem.  Applications can
        use the NUMA library or numactl system command to control the
        placement of processes and memory to achieve the best performance.

        You can use the numactl command in conjunction with the DMA
        Microbenchmarks to measure the performance of DMA operations in
        various NUMA configurations.  For example, to measure the performance
        of sequential DMA reads issued from CPU 0 to memory local to CPU 0,
        issue:

            numactl --cpunodebind=0 --membind=0 dmabench seqdmar

        The affinity option should be used for benchmarks that perform LS to
        LS transfers. Without affinity, the perform may significantly vary 
        from run to run depending physical SPEs assigned by the SPE Runtime
        Management library.
