This commit fixes the kernel record replay on both AMD and CUDA devices. It also re-organizes the record replay code, moves the whole code to separate files, and makes it extensible to support other record formats (potentially in the future). The environment variables for controlling the recording have also been modified.