Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering

368 points - last Sunday at 5:11 PM

Source

Comments

LatencyKills last Monday at 3:38 PM
I worked on the Xcode team for years and know the lengths Apple goes to make this stuff difficult to figure out.

I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.

vdivyanshu yesterday at 3:19 AM
I went digging down the rabbit hole over the last 6 hours on what compute around training can be extracted from M4/M5 Neural Engine chips: - was able to offload @karpathy's NanoGpt training run(partially) on Apple Neural Engine. - moved the Classifier & Softmax layers directly onto the ANE - Classifier is 10x faster, and Softmax is 34x faster - fixed memory exhaustion: original repo had an ARC memory leak that capped training at ~119 compile loads per process. - patched the C-bridge, allowing continuous, stable training

Repo - https://github.com/vipuldivyanshu92/ANEgpt

eleventyseven last Monday at 3:31 PM
> Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively

Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.

Octoth0rpe last Monday at 3:13 PM
Part 2 has benchmarks: https://maderix.substack.com/p/inside-the-m4-apple-neural-en...

6.6 FLOPS/W, plus the ability to completely turn off when not in use, so 0W at idle.

GeekyBear last Monday at 4:22 PM
The recent news is that Apple is supposedly replacing the Core ML framework with an updated version that will make it easier to integrate third party LLMs into your apps.

> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.

https://www.bloomberg.com/news/newsletters/2026-03-01/apple-...

zozbot234 last Monday at 7:46 PM
Much of this information we already knew the very basics of from documentation of the M1/M2 ANE as accessed via bare-metal from Asahi Linux, but it's nice to see confirmation and it being explored in further depth. Note that according to OP Parts 1/2 for very large matmuls CoreML adds little to no overhead compared to the lower-level interface, so there seems to be plenty of scope for supporting ANE for prefill in local AI frameworks. Decode is generally memory-bandwidth limited unless context is very large, and the ANE requires special handling (converting from matmul to 1x1 convolution as described here is wasteful of memory bandwidth, as is potentially dequantizing to INT8/FP16 in memory) so it's less of a clear win.
blobbers last Monday at 9:56 PM
Can someone help me understand when these neural engines kick in in open source software?

I typically use python ML libraries like lightgbm, sklearn, xgboost etc.

I also use numpy for large correlation matrices, covariance etc.

Are these operations accelerated? Is there a simple way to benchmark?

I see a lot of benchmarks on what look like C functions, but today in my jobs I rely on higher level libraries. I don't know if they perform any better on apple HW, and unless they have a flag like use_ane I'm inclined to think they do better.

Of course chatgpt suggested I benchmark an Intel Mac vs. newer apple silicon. Thanks chatgpt, there's a reason people still hate AI.

behnamoh last Monday at 4:26 PM
It's insane that the source code of ANE is not available even to the MLX team, possibly one of the reasons Awni (MLX project head) left Apple.
notepad0x90 last Monday at 7:47 PM
I've been guilty of this myself, but every other comment here is like "What about <insert something unrelated to the topic but related to apple>".
instahotstar yesterday at 7:51 AM
Really impressive reverse engineering work. I’m curious how much of the Neural Engine’s instruction set is undocumented versus inferred experimentally. Also wondering how Apple balances power efficiency vs peak throughput in the M4 compared to previous generations.
nbardy yesterday at 4:18 AM
Why does apple want to make this hardware hard to access?

What actual benefits do they get?

I guess they can have their own models run faster than the competition on their hardware? But they don't even really have anything that consumers use on the ANE as far as I can tell and local LLMs are taking off on macs and could really benefit from this

love2read last Monday at 2:41 PM
This article was clearly written by a human (and AI) but still has a few "LLMisms" such as:

- The key insight - [CoreML] doesn't XXX. It YYY.

With that being said, this is a highly informative article that I enjoyed thoroughly! :)

The article links to their own Github repo: https://github.com/maderix/ANE

cedws yesterday at 10:11 AM
I’m surprised that Claude assisted with this reverse engineering work. I used Codex recently for a similar purpose and got an account warning. Initially refused to do it, and then I was able to trick it. Seems I might have to make the jump back.
mattlangston last Monday at 2:49 PM
The future is bright for software engineers.

The big takeaway isn't reverse engineering the ANE per se, but what Manjeet could do with his software engineering skills when accelerated by AI.

This is a good example of the present state of software engineering. Not future state - present state.

Geee yesterday at 12:33 AM
Is it really worth having separate GPU and NE? Seems redundant and weird compared to what Nvidia is doing, i.e. "GPUs are good NEs", or is that not really true?
giancarlostoro last Monday at 5:06 PM
Reverse Engineering with AI is only going to get better. I have seen some crazy things friends of mine have done with Claude alone. Let's just says SaaS isn't the only industry that could one day suffer.
kamranjon last Monday at 3:28 PM
I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze
msie last Monday at 5:14 PM
I remember the good old days when Apple was desperate for developers and produced great documentation and there were a lot of great 3rd-party books too. You can't just give out awards in hopes that someone will make that great app.
daoistmonk last Monday at 4:13 PM
Tangential: Is anyone doing something similar to accelerate the support matrix of Linux on anything higher than M2?
grey-area last Monday at 6:52 PM
If only they could fix the iOS autocomplete, which is getting worse with every iteration.
ericol last Monday at 6:45 PM
> human intuition driving the exploration

This, a thousand times this.

For me, what AI brings is augmented humans. Just as we don't calculate on paper anymore, what is the reason of doing things by hand when a machine in X times better.

Want to code by hand, as artisans of old? Suit yourself.

I, for one, love the smell of burning chrome.

rayiner yesterday at 12:15 AM
Holy crap, 32MB of SRAM on the chip for AI.
techpulse_x last Monday at 3:00 PM
[dead]
heggenhougen yesterday at 11:55 AM
[flagged]
FL33TW00D last Monday at 4:50 PM
Unreadable Claude slop
poszlem last Monday at 1:51 PM
Genuine question, not trying to throw a shade or anything, but are those cores actually useful with the state of apple intelligence being what it is?
mayhemducks last Monday at 5:06 PM
I never realized just how much hardware engineering Apple dedicated to enabling people to type faster with their thumbs!