Many software engineers become accustomed to working with a single hardware platform, and in this mode it is easy to lose sight of the more general issues.
For example, many software engineers have only worked on systems that have a single CPU, where atomicity is only an issue between interrupt service routines and threads. Thus, if a single instruction is not interruptible, and a single instruction performs many memory operations, then atomicity is guaranteed by the use of a single instruction. Consider the read-modify-write instructions of a CISC CPU such as increment and decrement instructions. If the software engineer subsequently proceeds to a RISC architecture machine, where no such instructions exist (loads,stores, and arithmetic operations are all separate instructions), then the old assumptions no longer apply.
In spite of the preceeding, the remainder of this document will narrow its focus and examine atomicity as it pertains to memory access only, and leave interrupt synchronization for another document.
The unarbitrated-asynchronous nature of the memory is such that either CPU may read or write a location (octet) at any time. The memory makes no attempt to serialize access to individual memory locations. Thus one CPU may be in the middle of writing a location while the other CPU reads the same location.
Under these conditions, the CPU performing the read operation is not even ensured that it will read the previous value of the location or the new value (as being written by the other CPU.) Some of the bits may have changed and others may be in transition when the reading CPU read-cycle completes.
If the memory technology is such that the outputs of the registers for each bit may become indeterminate whether or not the new value being written is the same as the old value, then we cannot reliably use a single octet memory location as a flag with which to synchronize access by the two CPUs.
Symetric Multi Processing is an architecture where two or more processors of the same type share a single memory space. To relieve bus contention, each processor typically has its own cache. In SMP systems, the bus protocol is managed by a bus controller that controls access to the bus. The bus protocol generally includes mechanisms for enforcing coherencey between the caches and RAM. This hardware based cache coherency enforcement uses a priciple called snooping.
Non-Uniform Memory Access (NUMA) is an architecture where SMP nodes (groups of SMP arranged memory and processors) share a single memory space. The memory space is, however, distributed/copied across the physical memory (RAM) of the nodes.
Note that Simultaneous Multi Threading (SMT) implementations share execution units and bus interface and thus do not have independent caches. Hyper-Threading is Intel's name for SMT.
This diagram is slightly misleading in that it implies two complete processors sharing the same cache. Indeed, this is how such systems logically appear to software, but in reality the two processor blocks must share execution units.
A simple SMT system, like the one illustrated here, has no issue with cache coherency between the two processing elements, since they share the same cache. However, there is still a desire (not need) for a coherent bus protocol when the SMT processors must share RAM with a DMA controller as shown below.
Some systems exhibit attributes of traditional SMP NUMA arrangements, and SMT as shown below. The Xenon device used in the Xbox 360 has such an architecture albeit with three SMT elements instead of the two illustrated here.