Memory Ordering¶
The following section discusses NVSHMEMAPIs that provide mechanisms to ensure ordering and/or delivery of completion on memory store, blocking, and nonblocking NVSHMEM routines. Table [mem-order] lists the operations affected by NVSHMEM memory ordering routines.
| Operations | Fence | Quiet |
| Memory Store | X | X |
| Blocking Put | X | X |
| Blocking Get | ||
| Blocking AMO | X | X |
| Non-blocking Put | X | X |
| Non-blocking Get | X | |
| Non-blocking AMO | X [1] | X |
NVSHMEM_FENCE¶
-
void
nvshmem_fence(void)¶
-
__device__ void
nvshmem_fence(void)
Description
This routine ensures ordering of delivery of operations on symmetric
data objects. Table [mem-order] lists the operations
that are ordered by the nvshmem_fence routine. All operations on
symmetric data objects issued to a particular PE prior to the call to
nvshmem_fence are guaranteed to be delivered before any subsequent
operations on symmetric data objects to the same PE. nvshmem_fence
guarantees order of delivery, not completion. It does not guarantee
order of delivery of nonblocking Get or values fetched by nonblocking
AMO routines.
Fence operations issued on the CPU and the GPU only order communication operations that were issued from the CPU and the GPU, respectively.
Returns
None.
Notes
nvshmem_fence only provides per-PE ordering guarantees and does not
guarantee completion of delivery. nvshmem_fence also does not have
an effect on the ordering between memory accesses issued by the target
PE. nvshmem_wait_until, nvshmem_test, nvshmem_barrier,
nvshmem_barrier_all routines can be called by the target PE to
guarantee ordering of its memory accesses. There is a subtle difference
between nvshmem_fence and nvshmem_quiet, in that,
nvshmem_quiet guarantees completion of all operations on symmetric
data objects which makes the updates visible to all other PEs.
The nvshmem_quiet routine should be called if completion of
operations on symmetric data objects is desired when multiple PEs are
involved.
In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to nvshmem_fence. The nvshmem_fence routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling nvshmem_fence, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.
nvshmem_fence in a C program:
./example_code/shmem_fence_example.c Put1 will be ordered to be
delivered before put3 and put2 will be ordered to be
delivered before put4.See Ring Broadcast Example for example usage of nvshmem_fence.
NVSHMEM_QUIET¶
-
void
nvshmem_quiet(void)¶
-
__device__ void
nvshmem_quiet(void)
-
void
nvshmemx_quiet_on_stream(cudaStream_t stream)¶
Description
The nvshmem_quiet routine ensures completion of all operations on
symmetric data objects issued by the calling PE.
Table [mem-order] lists the operations for which the
nvshmem_quiet routine ensures completion. On systems with only
NVLink, all operations on symmetric data objects are guaranteed to be
complete and visible to all PEs when nvshmem_quiet returns. On
systems with both NVLink and InfiniBand, visibility is only guaranteed
at the destination PE.
Quiet operations issued on the CPU and the GPU only complete
communication operations that were issued from the CPU and the GPU,
respectively. To ensure completion of GPU-side operations from the CPU,
the developer must perform a GPU-side quiet operation and ensure
completion of the CUDA kernel from which the GPU-side operations were
issued, using operations like cudaStreamSynchronize or
cudaDeviceSynchronize. Alternatively, a stream-based quiet operation
can be used. Stream-based quiet operations have the effect of a quiet
being executed on the GPU in stream order, ensuring completion and
ordering of only GPU-side operations.
Returns
None.
Notes
nvshmem_quiet is most useful as a way of ensuring completion of
several operations on symmetric data objects initiated by the calling
PE. For example, one might use nvshmem_quiet to await delivery of a
block of data before issuing another Put or nonblocking Put routine,
which sets a completion flag on another PE. nvshmem_quiet is not
usually needed if nvshmem_barrier_all or nvshmem_barrier are
called. The barrier routines wait for the completion of outstanding
operations to symmetric data objects on all PEs.
In an NVSHMEM program with multithreaded PEs, it is the user’s
responsibility to ensure ordering between operations issued by the
threads in a PE that target symmetric memory and calls by threads in
that PE to nvshmem_quiet. The nvshmem_quiet routine can enforce
memory store ordering only for the calling thread. Thus, to ensure
ordering for memory stores performed by a thread that is not the thread
calling nvshmem_quiet, the update must be made visible to the
calling thread according to the rules of the memory model associated
with the threading environment.
A call to nvshmem_quiet by a thread completes the operations posted
prior to calling nvshmem_quiet. If the user intends to also complete
operations issued by a thread that is not the thread calling
nvshmem_quiet, the user must ensure that the operations are
performed prior to the call to nvshmem_quiet. This may require the
use of a synchronization operation provided by the threading package.
For example, when using POSIX Threads, the user may call the
pthread_barrier_wait routine to ensure that all threads have issued
operations before a thread calls nvshmem_quiet.
nvshmem_quiet does not have an effect on the ordering between memory
accesses issued by the target PE. nvshmem_wait_until,
nvshmem_test, nvshmem_barrier, nvshmem_barrier_all routines
can be called by the target PE to guarantee ordering of its memory
accesses.
nvshmem_quiet in a C program:
./example_code/shmem_quiet_example.c Put1 and put2 will be
completed and visible before put3 and put4.| [1] | NVSHMEM fence routines does not guarantee order of delivery of values fetched by nonblocking AMO routines. |