Debugging GPU hangs with RADV¶
UMR (optional)¶
UMR is needed for dumping a lot of useful information. Clone, build and install
UMR. Do not forget to run
chmod +s $(which umr) so RADV can actually access UMR.
UMR needs to access some kernel debug interfaces:
chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri
Secure boot has to be disabled as well.
Generating and analyzing hang reports¶
With UMR installed, you can now set RADV_DEBUG=hang which makes RADV insert
trace markers and synchronization and check for hangs. The hang report will be
saved to ~/radv_dumps_<pid>_<time>. Inside the directory of the hang report,
there are a couple of files:
*.spv: SPIR-V binaries of the pipeline that was bound when the hang occurred.app_info.log:VkApplicationInfofields.bo_history.log: A list of every GPU memory allocation and deallocation. If the GPU hang was caused by a page fault, you can use radv_check_va.py to figure out if address is invalid or used after the memory was deallocated.bo_ranges.log: Address ranges that were valid at the time of submission.dmesg.log: Output ofdmesg, if available.gpu_info.log: Fields ofradeon_info.pipeline.log: IR of the shaders that were bound during the hang as well as programm counters of waves executing said shaders and bound descriptors.registers.log: Various GPU state registers.trace.log: An annotated list of the command stream that caused the hang. the commands that hung come after!!!!! This is the last trace point that was reached by the CP !!!!!.umr_ring.log: Similar totrace.log.umr_waves.log: A list of waves that were active at the time of the hang, including register values.vm_fault.log: The page fault address if a page fault occurred.
Debugging Steam games¶
Steam games require a bit more work so RADV can access UMR: In your Steam library,
make sure Tools is checked and search for Steam Linux Runtime.
Under Properties -> Installed Files, click Browse, open
_v2-entry-point and add
shift 2
exec "$@"
at the top of the file. Hang debugging can be enabled by selecting the faulting
game and adding RADV_DEBUG=hang %command% under Properties -> General
-> LAUNCH OPTIONS.
Debugging hangs without RADV_DEBUG=hang¶
In some situations, RADV_DEBUG=hang wouldn’t be able to generate a GPU hang
report, like for synchronization issues (because it enables
RADV_DEBUG=syncshaders behind the scene). An alternative solution is to
disable GPU recovery by adding amdgpu.gpu_recovery=0 to your kernel command
line options. And then invoke UMR manually with
umr --by-pci <pci_id> -O bits,halt_waves -go 0 -wa <ring> -go 1 2>&1 for
dumping the waves and umr --by-pci <pci_id> -RS <ring> 2>&1 for dumping the
rings once the GPU hang occurred.