You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
41 lines
2.8 KiB
41 lines
2.8 KiB
# run two tinygrad matrix example in a loop
|
|
# amdgpu-6.0.5-1581431.20.04
|
|
# NOT fixed in kernel 6.2.14
|
|
|
|
[ 553.016624] gmc_v11_0_process_interrupt: 30 callbacks suppressed
|
|
[ 553.016631] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:9 pasid:32770, for process python3 pid 10001 thread python3 pid 10001)
|
|
[ 553.016790] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
|
|
[ 553.016892] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00901A30
|
|
[ 553.016974] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
|
|
[ 553.017051] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
|
|
[ 553.017111] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
|
|
[ 553.017173] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3
|
|
[ 553.017238] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
|
|
[ 553.017300] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0
|
|
[ 553.123921] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
|
|
[ 553.124153] amdgpu: failed to add hardware queue to MES, doorbell=0x1a16
|
|
[ 553.124195] amdgpu: MES might be in unrecoverable state, issue a GPU reset
|
|
[ 553.124237] amdgpu: Failed to restore queue 2
|
|
[ 553.124266] amdgpu: Failed to restore process queues
|
|
[ 553.124270] amdgpu: Failed to evict queue 3
|
|
[ 553.124297] amdgpu: amdgpu_amdkfd_restore_userptr_worker: Failed to resume KFD
|
|
|
|
# alternative crash in kernel 6.2.14
|
|
|
|
[ 151.097948] gmc_v11_0_process_interrupt: 30 callbacks suppressed
|
|
[ 151.097953] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32771, for process python3 pid 7525 thread python3 pid 7525)
|
|
[ 151.097993] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
|
|
[ 151.098008] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801A30
|
|
[ 151.098020] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
|
|
[ 151.098032] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
|
|
[ 151.098042] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
|
|
[ 151.098052] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3
|
|
[ 151.098062] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
|
|
[ 151.098071] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0
|
|
[ 151.209517] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
|
|
[ 151.209724] amdgpu: failed to add hardware queue to MES, doorbell=0x1002
|
|
[ 151.209734] amdgpu: MES might be in unrecoverable state, issue a GPU reset
|
|
[ 151.209743] amdgpu: Failed to restore queue 1
|
|
[ 151.209751] amdgpu: Failed to restore process queues
|
|
[ 151.209759] amdgpu: amdgpu_amdkfd_restore_userptr_worker: Failed to resume KFD
|
|
[ 151.209858] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
|
|
|