kernel driver: AppleH11ANEInterface requires entitlement: com.apple.ane.iokit-user-access compiler is run in ANE_ProgramCreate_gated 2 helper processes: /usr/libexec/aned ANECompilerService Espresso: Contains ANECompilerEngine AppleNeuralEngine: Objective-C interface called by Espresso ANEServices: communication with the device ANECompiler: compile plist into hwx file com.apple.ANECompilerService.allow in AppleNeuralEngine? Called from ANECompilerService.xpc in AppleNeuralEngine.framework == Model Flow == Keras/ONNX model | | 1_build | (coremltools, open source) v CoreML model | | TODO: automate this | Grabbed plist from lldbing ANECompilerService during 1_build | (Espresso) v net.plist | | 2_compile | (AppleNeuralEngine, ANECompiler) v model.hwx | | 3_run | (AppleNeuralEngine, ANEServices, AppleH11ANEInterface) v TODO: Write a nice plist grabber DONE: Write a call to the compiler with plist+weights DONE: Write an hwx runner == Tracing the Compiler == ANECCompileProcedure ZinAneCreateIr ZinParseNeuronUnit ZinAneCoreCompile ZinAneCodeGeneration ZinIrCodegenHandleKernels ZinIrTargetH13::CodegenTds ZinIrCacheHintTable ZinIrCodegenHandleTds_v7 ZinIrCodegenHandleTdsMakeList<7u> ZinAneInstruction ZinAneTd<7u>::HandleEngineLayer ZinAneInstruction::HandleTdHeader HandleNELayer<7u> ZinAneInstruction::HandleCommonConfig ZinAneInstruction::HandleCommonConfigCommonOpcodes ZinIrCodegenHandleTds<7u> 0x1bb93ae00 <-- this is the store of the first byte in the hwx CalculateSizeInBytesFromRegCount (x*4+4) 0xf @ 0x128-0x168 (base 0x1003047b0) 0x1b @ 0x16c-0x1dc 0x11 @ 0x1e0-0x228 0x3 @ 0x22c-0x23c 0x4 @ 0x240-0x254 0x6 @ 0x258-0x274(end) AddReloc (this is gold! x4 goes in the hwx) ZinAneTd<7u>::HandleEngineLayer rbreak ^ZinAneInstruction* weeee ZinIrRegBitPrintOutDebug_7u_ print (void)debugregs(0, 0x0000000100211030+8, 3) == min.plist == Types: GOC, Conv, Broadcast, ScaledElementWise, Reshape, InputView, Neuron, Concat ops have length 0x300, seems like one basic op repeated header 0x0-0x1c u32 0x1c = next op offset u16 0x20 = output address? == section break 0x2c (weights) == reloc 0x2c-0x74 = K2DBE6976FEB616E6867A2E3853FC37D0F101C4C51BA4A80C103359643338C0C1_ne_0 K2DBE6976FEB616E6867A2E3853FC37D0F101C4C51BA4A80C103359643338C0C1_ne_1 16 output channel parallel: u32[16] 0x34-0x74 = 0x80 | 1 if used u32[16] 0x74-0xB4 = u32[16] 0xB4-0xF4 = == section break 0x128 (conv) == u16 0x128 = InputWidth u16 0x12A = InputHeight u16 0x12C = InputDepth u32 0x130 = (OutputType * 0x10) | InputType u32 0x134 = InputChannels u32 0x138 = OutputChannels u16 0x13C = OutputWidth u16 0x13E = OutputHeight u16 0x140 = OutputDepth u16 0x144 = 0xa000 | (KernelHeight * 0x20) | KernelWidth u16 0x146 = 0x5000 | (PadTop * 0x40) | (PadLeft * 2) u16 0x14C = BatchSize u32 0x150 = OutputHeight? == section break 0x16c (input) == reloc 0x16c-0x174 = image u32 0x178 = InputRowStride u32 0x17C = InputPlaneStride u32 0x180 = InputDepthStride u32 0x184 = InputBatchStride u8 0x1A7 = InputInterleave == section break 0x1e0 == u8 0x1E5 = InputInterleave u32 0x1F4 = InputChannels * 0x10 u32 0x1F8 = InputDepth * InputChannels * 0x10 u8 0x211 = OutputInterleave u32 0x220 = OutputChannels * 0x10 u32 0x224 = OutputDepth * OutputChannels * 0x10 == section break 0x22c (scaling) == u16 0x230 = BiasScalar u16 0x232 = ScaleScalar == section break 0x240 == u8 0x240 = 0x80 | KernelType u8 0x241 = 4 * hasbatch u16 0x246 = 0x10 | 2 * neuron? == section break 0x258 (output) == reloc 0x258-0x25c = probs@output/src u32 0x260 = OutputRowStride u32 0x264 = OutputPlaneStride u32 0x268 = OutputDepthStride u32 0x26C = OutputBatchStride u8 0x273 = OutputInterleave == Zin Constants == kZinIrOpCodeConv = 0? kZinIrOpCodePool = 1 kZinIrOpCodeElementWiseOp = 6 kZinIrOpCodeConcat = 8 kZinIrOpCodeFlattenComposite kZinIrOpCodeNEConvOp kZinIrOpCodeTranspose 0: CONV 1: POOL 2: SCALE_BIAS 3: TERNARY_DYNAMIC_GOC 4: BINARY_DYNAMIC_GOC 5: ACTIVATION 6: EW 7: SCALED_EW 8: CONCAT 9: SPLIT 10: COPY 11: FLATTEN 12: UNFLATTEN 13: CROSS_CORRELATION 14: KERNEL_RASTERIZER 15: ARG_MIN_MAX 16: MATRIX_MULT 17: BROADCAST 18: FLATTEN_COMPOSITE 19: UNFLATTEN_COMPOSITE 20: KERNEL_RASTERIZER_COMPOSITE 21: CROSS_CORRELATION_COMPOSITE 22: LIVE_IN 23: CONST_IN 24: LIVE_OUT 25: REDUCTION 26: ALIAS 27: Typecast 28: RESHAPE 29: VIEW 30: TRANSPOSE 31: SPACE_TO_BATCH 32: BATCH_TO_SPACE 33: SOFTMAX 34: INSTANCE_NORM 35: L2_NORM 36: MINMAX_NORM 37: LRN 38: COST_VOLUME 39: PIXEL_SHUFFLE 40: FPS 41: RS 42: PEFUSED_ELEMENTWISE 43: PEFUSED_POOL 44: PEFUSED_GOC 45: NEFUSED_CONV 46: NEFUSED_POOL 47: NEFUSED_EW 48: NEFUSED_BYPASS # guessing from the hwx kZinTensorFormatUInt8 = 0 kZinTensorFormatInt8 = 1 kZinTensorFormatFloat16 = 2 kZinTensorFormatInvalid Zin (plist format) ---(ZinAneCoreCompile)---> Mir (hwx format)? ZinAneCodeGeneration? ZinIrStatus GetKernelFormat<6u>(ZinKernelFormat param_1,ane_ne_kernel_cfg_kernel_fmt *param_2) List of allowed numbers NeuronTypes (changes the LUT): Tanh Log2 Exp2 Sign = ZinMirActivationV7::GetSignLut ...many more in ANECompiler Investigate: ZinMirActivationV7::PrintLut(ZinMirActivationV7 *this,ane_nonlinear_lut_v7up_t *param_1 0: NONE 1: RELU 2: SIGMOID 3: SIGMOID_HIGH_PRECISION 4: TANH 5: CLAMPED_RELU 6: PRELU 7: DIRAC 8: INT 9: FRAC 10: SQRT 11: RSQRT 12: INV 13: SQR 14: LOG2 15: EXP2 16: ELU 17: SIGN 18: EQUAL_ZERO 19: NON_ZERO 20: LESS_THAN_ZERO 21: LESS_EQUAL_ZERO 22: GREATER_EQUAL_ZERO 23: GREATER_THAN_ZERO 24: CUSTOM_LUT 25: C_DIM_CONCAT 26: C_DIM_STRIDED_CONCAT 27: H_DIM_CONCAT 28: W_DIM_CONCAT 29: D_DIM_CONCAT 30: N_DIM_CONCAT 31: H_DIM_STRIDED_CONCAT CacheHint 0: ALLOC 1: NOALLOC 2: DROP 3: DEPRI conv kinds 0: regular 1: channelwise 2: grouped == plist exploration == Float16 -> UInt8 for output works, Float32 doesn't Same for input All weights must be float Index 0: model.espresso.weights @ 192 is weights Index 1: net.additional.weights @ 0 is bias Float16 -> Float32 for bias works It's possible the compiler is Float32 -> Float16 converting, and the engine only supports Float16 + UInt8 == call to the compiler (in dmesg!) == [54476.282258]: H11ANEIn: ANE_ProgramCreate_gated:, ZinComputeProgramMake, get Mcache size: 0x0 [54476.282259]: H11ANEIn: ANE_ProgramCreate_gated:,Program Identifier:ANEC v1 zin_ane_compiler v4.2.1 -t h13 --fdram-allocator=ffreuse --fdram-tensor-priority=sizethenliverange --fl2-allocator=ffreuse --fl3-allocator=ffreuse --fl2-cache-mode=resident --fsignature=ident --memcache-strategy= [54476.282262]: --memcache-size=4194304 --fspatial-split=disabled --fkernel-rewind=enabled --Wl-undefined=fvmlib -i /Library/Caches/com.apple.aned/tmp/Python/DB7E897E7F4D5D27501A998428B6D3863AFD96CEA82DAF2207A75394E6BAC44C/37C083FF396EB5948979EE20FD0457483E4ACE840AD23391A129BB83CFBC9C63/net.plist -o /Library/Caches/com.apple.aned/20A2411/Python/C9981871BC59572E74AFA3014B183EA37567EE9A2A08328446CE4A2B754E109D/37C083FF396EB5948979EE20FD0457483E4ACE840AD23391A129BB83CFBC9C63/model.hwx.tmp == ANECCompile (in ANECompiler framework) == ANECCompile(__CFDictionary *param_1, __CFDictionary *param_2, unsigned long param_3) param_1: { InputNetworks = ( { NetworkPlistName = "net.plist"; NetworkPlistPath = "/Library/Caches/com.apple.aned/tmp/run/A2ACB9D5AA31B301563A4F62885BA379E62B0E1240E95C6902A93900FE0A9B54/37C083FF396EB5948979EE20FD0457483E4ACE840AD23391A129BB83CFBC9C63/"; } ); OutputFileName = "model.hwx.tmp"; OutputFilePath = "/Library/Caches/com.apple.aned/20A2411/run/E68910CD1994681121EEDAFAE1BC524AA8E84CF80C42AFC0C7DE2C082C58BDFD/37C083FF396EB5948979EE20FD0457483E4ACE840AD23391A129BB83CFBC9C63/"; } param_2: { TargetArchitecture = h13; } == Backtrace of device open == * frame #0: 0x00000001a68fac54 ANEServices`H11ANEDeviceOpen frame #1: 0x00000001a78405b8 AppleNeuralEngine`__29-[_ANEDeviceController start]_block_invoke + 436 frame #2: 0x0000000193c84420 libdispatch.dylib`_dispatch_client_callout + 20 frame #3: 0x0000000193c92a98 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 60 frame #4: 0x00000001a78403e8 AppleNeuralEngine`-[_ANEDeviceController start] + 136 ... frame #23: 0x00000001a64a4f38 Espresso`Espresso::ANERuntimeEngine::compiler::build_segment(std::__1::shared_ptr const&, int, Espresso::net_compiler_segment_based::segment_t const&) + 2080 ... frame #31: 0x000000019ab6099c CoreML`-[MLNeuralNetworkEngine rebuildPlan:] + 1640 == Backtrace of run? == * frame #0: 0x00000001a68f9108 ANEServices`H11ANEProgramProcessRequestDirect frame #1: 0x00000001a7839694 AppleNeuralEngine`-[_ANEProgramForEvaluation processRequest:qos:qIndex:modelStringID:options:error:] + 1904 frame #2: 0x00000001a7843ba4 AppleNeuralEngine`-[_ANEClient doEvaluateDirectWithModel:options:request:qos:error:] + 1236 frame #3: 0x00000001a7842034 AppleNeuralEngine`-[_ANEClient evaluateWithModel:options:request:qos:error:] + 104 frame #4: 0x00000001a64a2988 Espresso`Espresso::ANERuntimeEngine::compiler::__forward_segment(std::__1::shared_ptr const&, int, Espresso::net_compiler_segment_based::segment_t const&) + 2008 frame #5: 0x00000001a6414548 Espresso`Espresso::net_compiler_segment_based::__forward(std::__1::shared_ptr const&) + 992 frame #6: 0x00000001a67e2e3c Espresso`EspressoLight::espresso_plan::dispatch_task_on_compute_batch(std::__1::shared_ptr const&, std::__1::shared_ptr const&) + 612 frame #7: 0x00000001a67ebab0 Espresso`EspressoLight::espresso_plan::execute_sync() + 356 frame #8: 0x00000001a67f26fc Espresso`espresso_plan_execute_sync + 120 frame #9: 0x000000019ab674b8 CoreML`-[MLNeuralNetworkEngine executePlan:error:] + 136 frame #10: 0x000000019ab6799c CoreML`-[MLNeuralNetworkEngine evaluateInputs:bufferIndex:options:error:] + 368