FMA instruction set

Wikibooks has a book on the topic of: X86 Assembly/AVX, AVX2, FMA3, FMA4

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.[1] There are two variants:

New instructions

FMA3 and FMA4 instructions have almost identical functionality but are not compatible. Both contain fused multiply–add (FMA) instructions for floating point scalar and SIMD operations, but FMA3 instructions have three operands while FMA4 ones have four. The FMA operation has the form d = round(a × b + c) where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.

The 4-operand form (FMA4) allows a, b, c and d to be four different registers, while the 3-operand form (FMA3) requires that d be the same register as a, b or c. The 3-operand form makes the code shorter and the hardware implementation slightly simpler while the 4-operand form provides more programming flexibility.

See XOP instruction set for more discussion of compatibility issues between Intel and AMD.

FMA3 instruction set

CPUs with FMA3

Excerpt from FMA3

Mnemonic (AT&T) Operands Operation
VFMADD132PDy ymm, ymm, ymm/m256 $0 = $0×$2 + $1
VFMADD132PSy
VFMADD132PDx xmm, xmm, xmm/m128
VFMADD132PSx
VFMADD132SD xmm, xmm, xmm/m64
VFMADD132SS xmm, xmm, xmm/m32
VFMADD213PDy ymm, ymm, ymm/m256 $0 = $1×$0 + $2
VFMADD213PSy
VFMADD213PDx xmm, xmm, xmm/m128
VFMADD213PSx
VFMADD213SD xmm, xmm, xmm/m64
VFMADD213SS xmm, xmm, xmm/m32
VFMADD231PDy ymm, ymm, ymm/m256 $0 = $1×$2 + $0
VFMADD231PSy
VFMADD231PDx xmm, xmm, xmm/m128
VFMADD231PSx
VFMADD231SD xmm, xmm, xmm/m64
VFMADD231SS xmm, xmm, xmm/m32

FMA4 instruction set

CPUs with FMA4

Excerpt from FMA4

Mnemonic (AT&T) Operands Operation
VFMADDPDx xmm, xmm, xmm/m128, xmm/m128 $0 = $1×$2 + $3
VFMADDPDy ymm, ymm, ymm/m256, ymm/m256
VFMADDPSx xmm, xmm, xmm/m128, xmm/m128
VFMADDPSy ymm, ymm, ymm/m256, ymm/m256
VFMADDSD xmm, xmm, xmm/m64, xmm/m64
VFMADDSS xmm, xmm, xmm/m32, xmm/m32

History

The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:

AMD explicitly revealed that Zen, its 3rd-generation x86-64 architecture in its first iteration (znver1 – Zen, version 1); would drop support for FMA4 in a patch to the GNU Binutils package.[13] There has been initial confusion regarding whether FMA4 was implemented or not due to errata in the initial patch that has since then been rectified.[14]

Compiler and assembler support

Different compilers provide different levels of support for FMA4:

References

  1. "FMA3 and FMA4 are not instruction sets, they are individual instructions -- fused multiply add. They could be quite useful depending on how Intel and AMD implement them" Woltmann, George (Prime95). "Intel AVX and GIMPS". http://www.mersenneforum.org/index.php. Great Internet Mersenne Prime Search (GIMPS) project. Retrieved 27 July 2011. External link in |work= (help)
  2. "Striking a balance". Dave Christie, AMD Developer blogs. May 7, 2009. Retrieved 2009-05-08.
  3. Maffeo, Robin. "AMD and the Visual Studio 11 Beta". AMD. Retrieved 19 April 2012.
  4. "AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions" (PDF). AMD. May 1, 2009.
  5. "New "Bulldozer" and "Piledriver" Instructions A step forward for high performance software development" (PDF). AMD. October 2012.
  6. "128-Bit SSE5 Instruction Set". AMD Developer Central. Archived from the original on 2008-01-15. Retrieved 2008-01-28.
  7. "Intel Advanced Vector Extensions Programming Reference" (PDF). Intel. Retrieved 2008-04-05.
  8. "Intel Advanced Vector Extensions Programming Reference". Intel. Retrieved 2009-05-06.
  9. "Striking a balance". Dave Christie, AMD Developer blogs. May 7, 2009. Retrieved 2009-05-08.
  10. 1 2 "New Bulldozer and Piledriver Instructions" (PDF). AMD. Retrieved 25 July 2013.
  11. "Software Optimization Guide for AMD Family 15h Processors" (PDF). AMD. Retrieved 19 April 2012.
  12. "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 25 July 2013.
  13. https://sourceware.org/ml/binutils/2015-03/msg00078.html
  14. https://sourceware.org/ml/binutils/2015-08/msg00039.html
  15. 1 2 Latif, Lawrence (Nov 14, 2011). "AMD Bulldozer only FMA4 and XOP instructions are supported by GCC Intel still mute". The Inquirer.
  16. "FMA4 Intrinsics Added for Visual Studio 2010 SP1".
  17. "EKOPath man doc".
  18. "LLVM 3.1 Release Notes".
This article is issued from Wikipedia - version of the Sunday, October 11, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.