SIMD units, also known as vector units, have been employed in CPUs since the late 1990s and are the foundation for GPUs’ massive processing capability, as they allow large amounts of data to be processed in parallel. However, despite their massive computing capability, they have a variety of performance restrictions. What are SIMD units’ limitations?
SIMD is an abbreviation for Single Instruction Several Data, a type of unit that executes the same instruction, regardless of its type, on multiple data sets at the same time. They are also known as vector units since they are typically packed in a sequence of data values presented as a vector.
Because everything related to picture and sound works with vast amounts of data but with extremely recursive instructions, SIMD units, along with register and instruction set extensions, began to be implemented in CPUs. This happened in the late 1990s, and there has been a definite evolution in this type of unit in CPUs since then.
SIMD units are also utilized in graphics processing units (GPUs). For example, consider AMD’s “Stream Processors” and NVIDIA’s, erroneously dubbed “CUDA Cores”, which are simply ALUs embedded in SIMD units. As a result, even if they are manipulating distinct data, they all receive the same instruction in unison and operate with it.
Limitations or bottlenecks of SIMD units
Because SIMD units can operate several data at the same time, they multiply the computational capacity of a processor. However, they have a series of associated limitations that make their performance way lower from the theoretical ideal they should have, specifically three of them, which we are going to describe below.
Instruction size limitations in SIMD units
The first is that, in terms of semiconductor design, the size of the data with which SIMD units work is fixed, necessitating the expansion of the instruction set to allow them to work with data of varying accuracy. This is an overcomplication when developing new CPUs and GPUs, and it has become a nightmare with the various data formats that have emerged in recent years. Especially with the introduction of new low precision data formats by artificial intelligence.
The problem is exacerbated in ISAs that use fixed-size instructions and hence use the same number of bits as ARM. If, in the future, the SIMD instruction size must be increased, there will be less bits available for the opcode and hence the number of instructions. Certain RISC designs are being forced to use accelerators or co-processors in order to deal with massive SIMD instructions.
It does not result in a disadvantage for x86 processors, where the instruction size is not fixed, but as a trade-off it overcomplicates the size of the control unit.
Jumps and loops affect performance
Another bottleneck or limitation of SIMD units is in the case of loop or jump instructions, where a certain value among a SIMD unit’s myriad of operands may cause the instruction to continue in a different fashion. Not to mention that the speed of any CPU is always determined by its slowest component, and an ALU may enter a considerably longer loop or jump. This is why GPUs that rely solely on SIMD units have hop prediction units running in parallel, and their performance suffers when one of these instructions is present.
If one of the SIMD unit operations has not been completed owing to a loop or a jump, then the jump to the next instruction for this type of unit is not possible. This is why a technique known as Loop Unrolling is used, which involves turning a looping or jumping code into a series of instructions that achieve the same result. To accomplish this, the source code compiler is used to transform the source code to machine code, or, depending on the architecture, it must be done by hand, but the trade-off is that the program becomes larger.
ALUs are not used in SIMD instructions
An instruction for a SIMD unit does not always utilise all of its ALUs, leading a portion of the unit to be idle since it lacks an operand to work on. The ALUs that are not used should ideally be assigned to the following instruction, but this is not possible, which is why it is critical to make the most of the resources in this aspect.
This is commonly seen in GPUs, where the waves of data and instructions that reach the shader units may not always occupy all of the slots, resulting in a considerable drop in performance when employing these types of units, but this is not always the case. It should be noted that SIMD units are typically configured with ALUs that are a power of two, therefore if the quantity of data to be handled is not a multiple of two, the SIMD units’ calculating capacity is squandered.