Add a fast path for the common case that total work-group size is multiple of max sub-group size. The fallback path is ported from amdgpu/workitem/clc_get_sub_group_size.cl. Compiler can generate predicated instructions for the fallback path to avoid branches.