* bigmodel
* more debug print
* debugging bigmodel
* remove the tanh, debugging
* print images/buffers
* disassemble the command queues
* decompiler
* dump the shaders
* full disasm
* support patching kernel and fixing convolution_horizontal_reduced_reads_1x1
* microbenchmark
* 42 GFLOPS, 1 GB/s
* gemm benchmark
* 75 GFLOPS vs 42 GFLOPS
* 115 GFLOPS
* oops, never mind
* gemm image is slow
* this is pretty hopeless
* gemm image gets 62 GFLOPS
* this is addictive and still a waste of time
* cleanup cleanup
* that hook was dumb
* tabbing
* more tabbing
Co-authored-by: Comma Device <device@comma.ai>
old-commit-hash: 78a352a8ca