AMDGPU previously had no target-specific LSR cost model, so the generic
heuristic would often introduce extra induction variables and base-add
chains that hurt VALU throughput on GFX9+ (observed on gfx942).
Implement a custom cost model:
- isLSRCostLess(): prioritize per-iteration instruction count over setup
costs, penalize IV multiplies, and demote register count. Pre-GFX9 falls
back to the default comparator.
- getScalingFactorCost(): report that base+scale*index addressing
requires an extra ADD instruction.
- isNumRegsMajorCostOfLSR(): return false.
- shouldDropLSRSolutionIfLessProfitable(): return true.
Assisted-by: Claude Opus