[AArch64][clang][llvm] Add ACLE Armv9.7 matrix multiply-accumulate intrinsics (#193017)
Implement new ACLE matrix multiply-accumulate intrinsics for Armv9.7:
```c
// 16-bit floating-point matrix multiply-accumulate.
// Only if __ARM_FEATURE_SVE_B16MM
// Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM).
svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
// Half-precision matrix multiply accumulating to single-precision instruction.
// Requires the +f16f32mm architecture extension.
float32x4_t vmmlaq_f32_f16(float32x4_t r, float16x8_t a, float16x8_t b);
// Non-widening half-precision matrix multiply instruction.
// Requires the +f16mm architecture extension.
float16x8_t vmmlaq_f16_f16(float16x8_t r, float16x8_t a, float16x8_t b);
```