Accelerating TensorFlow Performance on Mac

vimy · on Nov 19, 2020

"On my models currently in test, it’s like using a 1080 or 1080 TI." https://twitter.com/spurpura/status/1329168059647488000

That's amazing.

qd6pwu4 · on Nov 19, 2020

Is it training or inference? Inference only is amazing enough, and I don't expect M1 to beat 1080 for training deep neural networks.

why_only_15 · on Nov 19, 2020

The benchmarks here are for training it says.

sandGorgon · on Nov 19, 2020

This is funny - it clearly points to Intel & AMD as the culprits to why we didn't have a CUDA alternative for this long. Not Nvidia as long reviled.

If Apple can work with...Google to get their framework changed for the M1, then there is absolutely no excuse for Intel/AMD. They had a decade to fix this.

They deserve their fate.

rfoo · on Nov 19, 2020

Surely this is funny: too many Apple fanboys get on the hype train.

AMD had been working with Google to get their framework changed for AMD's GPU for more than two years [1], and all their work are upstreamed. Oh, and AMD's ROCm/OpenCL support is really for general computing, i.e. CUDA alternative, unlike the ML Compute here. ML Compute is something Apple created specifically for running neural networks, nothing more, and roughly equivalent to TensorRT / Android NN if you want to compare with other platforms. And it was here because the wall of Apple's walled garden is too high that nobody other than them can effectively optimize NN inference/training on their chip.

[1] https://pypi.org/project/tensorflow-rocm/#history

pjmlp · on Nov 19, 2020

How it can be a CUDA alternative when their driver support is still a joke and not available in all OSes?

sandGorgon · on Nov 19, 2020

FYI - i have never used a macbook in my life. I only use XPS with Fedora. So this frustration is personal.

I have been party to get AMD/Intel's CUDA alternative out the door on some of the ML libraries - which one is it now ? OpenCL...SYCL ...ROCm...PlaidML ? I cant remember.

All this time I was pissed at nvidia - surely they were playing subversive politics to kill all of this. With so many initiatives, surely AMD/Intel had their heart in the right place.

Apple and Google are cutthroat rivals. And they worked together for a just-released chip to get fully working acceleration support.

Here's where it gets sadder for me - Tensorflow has included a GPU accelerated version of Numpy ( https://twitter.com/fchollet/status/1292893864986984448?lang...) . Numpy itself is only accelerated using BLAS/LAPACK which cant leverage GPU all that well.

https://www.tensorflow.org/api_docs/python/tf/experimental/n...

At this point, it is basically a sealed deal - if you're even remotely dabbling in data science, you better be working on a Mac.

Which makes me hate my XPS all the more :(

sangnoir · on Nov 19, 2020

Tensorflow is accelerated just fine on AMD thanks to ROCm - everything you're praising Apple for doing, AMD has done for their hardware. You're disparaging AMD for not doing something they have actually done. There was even a tensorflow blog article announcing upstreaming AMD support, in 2018[0]

> At this point, it is basically a sealed deal - if you're even remotely dabbling in data science, you better be working on a Mac.

This is hilarious. You might have missed that the benchmark was missing Nvidia (or even AMD) graphics cards; I can't think of a lower bar for comparing ML performance than against Intel GPUs - perhaps Intel CPUs? While Apple has brilliant engineering, the M1 cannot possibly outperform the obscene number of transistors Nvidia & AMD throw at the task, even in older, mid-range cards. Not to mention power dissipation.

If you're dabbling, you're better off with Google's Colab[1] which has (free) hardware acceleration which is roughly on par with my 3-year-old RX580 for my Tensorflow projects. Colab will work on anything that can run a browser.

0. https://blog.tensorflow.org/2018/08/amd-rocm-gpu-support-for...

1. https://colab.research.google.com/

neilc · on Nov 19, 2020

> You might have missed that the benchmark was missing Nvidia (or even AMD) graphics cards;

The post does include a benchmark for an AMD GPU (Radeon Pro Vega II Duo) on the Mac Pro. Comparing the Mac Pro GPU vs. MBP M1 results, the GPU clearly wins, although in some cases the margin isn't as large as you might expect.

sangnoir · on Nov 19, 2020

It's hard to reconcile the performance across the 2 tests, perhaps they were set up/tuned differently. I wish they had published their methodology - I'd have loved to benchmark my long-in-the-tooth RX580 & rocm-tensorflow against their numbers

raverbashing · on Nov 19, 2020

Cool

Are they getting competitive results?

pjmlp · on Nov 19, 2020

Which is 100% correct.

NVidia cared to make CUDA into a polyglot GPGPU programming model, with nice debugging tools where you can do everything like CPU graphical debuggers, then thanks to PTX it was a matter to just add a new backend to your compiler.

Hence why, even though it doesn't get that much press, you can even use flavours of Java and .NET on CUDA.

Meanwhile Khronos kept driving their C only agenda, and when they realized the mistake, came up with SPIR (then SPIR-V after Vulkan was introduced), tried to also cater to the C++ devs (with printf like debugging tools).

All this effort was largely ignored by OEMs, with their lousy tools, thus ending with OpenCL 3.0 being effectively OpenCL 1.2 renamed to sound cool, and the C++ efforts (SYSCL) are now focusing on compute agnostic backends.

The problem wasn't NVidia, rather Intel and AMD did not deliver and all their alternatives to OpenCL are even worse, half backed attempts that always loose steam half way through.

syntaxing · on Nov 18, 2020

I wish they would compare to a 1080Ti or something along those lines. Nevertheless, this is pretty neat minus the RAM implications because realistically, I'm guessing you can use 5GB of RAM at most for training. I always wanted a computer where I can test the model for a couple batches (that doesn't cost $1200+) and then push to the cloud to train for scalability.

tpetry · on Nov 18, 2020

Seems i am not the only one who wants to know whether the M1 neural engine is performant enough for prototyping :)

It will be interesting how long the other deep learning frameworks will need to support the M1. Pytorch has not yet achieved comparable performance on a TPU compared to tensorflow.

syntaxing · on Nov 18, 2020

It would be pretty awesome to write an app, train the network, and deploy it using the same computer! I sometimes try to use google Colab to train some fun stuff (like style transfer) but sometimes it gives me a headache (grateful it is free though!!) It's hard to predict if the session will just stop and persistent data is a pain. Having an affordable (ironic using this word with Apple) for ML prototyping would be amazing.

FPGAhacker · on Nov 18, 2020

You know, they don’t mention the neural engine. I know very little of ML but maybe the neural engine isn’t helpful on the training side?

meowphius · on Nov 18, 2020

The API of the neural engine is closed for some reason.

therealmarv · on Nov 19, 2020

really? So WHO can use it then when not normal developers? Only Apple?

my123 · on Nov 19, 2020

It is not. See https://developer.apple.com/documentation/coreml and https://github.com/apple/coremltools.

The neural engine on the Apple A11 wasn't exposed to apps at least at launch, but that's no longer a thing on A12 onwards.

skwb · on Nov 19, 2020

My gut instincts (based off of nothing but pure speculation and being a ML developer for several years) says no.

For one, the M1 engine has 16 cores, vs over 1,500 for a 1660 series NVIDA card. I know it may not be apples to apples, but I have a very hard time believing it would be able to keep up with even the most marginal card for training.

wffurr · on Nov 19, 2020

I don’t know about the neural engine, but the GPU on the M1 has 128 EUs in each of the 8 cores, making for 1024 EUs total. It’s not quite in the same league as a 1660 Ti but it’s close: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

spideymans · on Nov 18, 2020

> Nevertheless, this is pretty neat minus the RAM implications because realistically, I'm guessing you can use 5GB of RAM at most for training

This will only be an issue for the first few months of the transition to Apple Silicon. Upcoming updates to the rest of the lineup will undoubtedly have more unified memory

syntaxing · on Nov 18, 2020

I'm torn because of this. The mac mini just got on sale at Costco for $649. Really tempted to purchase it now but intuition is telling me to wait for the 16GB RAM.

dan1234 · on Nov 18, 2020

You can already spec the M1 with 16GB, it’s just pretty pricey.

I don’t even want to think what the 32GB and above will cost.

joshstrange · on Nov 19, 2020

I mean, I got the 15" 2018 MBP with 32GB i9 500GB w/ dedicated graphics for $3,899.00. A MBP 13" with 16GB and 1TB is only $1,899.00. I'd fall over myself to buy a M2/M1Z (whatever they call it) with 32-64GB ram and 1TB ssd for as much as I paid for my last MBP. I honestly couldn't be happier that the new 13 MBP doesn't fit my needs or I'd be very tempted to get one. Thankfully now I can wait for the virtualization story to stabilize, wait for more ram, and wait for more external display support (all of which I need).

I know that Apple charges a premium for their upgrades but from everything I've seen so far from the new 13" MBP I'll be happy to pay that premium this time around. I have always found the comparisons of ram/ssd in recent Apple MBP's to generic parts to be slightly disingenuous. Raw speeds don't really matter as much as how well everything works together in my experience and the new AS computers seem to take that to an even higher level.

physicsyogi · on Nov 18, 2020

It's not quite what you're asking for, but there's a comparison to the 1050Ti: https://www.macrumors.com/2020/11/16/m1-beats-geforce-gtx-10...

syntaxing · on Nov 19, 2020

Does graphic performance indicated ML training performance (to a certain degree)? Is it fair to say that if graphically it performs better, the ML training will be better with the proper software support?

deeeeplearning · on Nov 18, 2020

Does anyone doing serious commercial grade ML actually train models on their macbook?

roseway4 · on Nov 18, 2020

When prototyping or developing new model code, yes. If you’re building production ML systems, there’s plenty code to write and test before deploying to prod infrastructure.

To be clear, you would work with a very small data sample or synthetic data as your objective isn’t to train a model for production use.

Edit: clarity and grammar

skwb · on Nov 19, 2020

Yeah, exactly this. I mostly am using cloud VMs for training, but sometimes just need to be able to mess around on my local machine for my deployment pipeline. I don't really care if it takes ~10 seconds to do inference on a handful of images, I really just need to tinker with the interfaces to ensure everything is hooked up correctly.

fxtentacle · on Nov 18, 2020

Surely not. I'm working on a PC with 64GB of main RAM and 12GB of dedicated GPU RAM and I already have to do tricks to be able to squeeze state of the art models into my limited memory.

For comparison, NVIDIA just upgraded from 40GB to 80GB.

vimy · on Nov 18, 2020

Unified memory on Apple silicon means the gpu/neural engine has as much ram as the mac. A future mbp with 64 gb ram will be able to fit a lot of models. A whole ML laptop for the price of one gpu.

fxtentacle · on Nov 19, 2020

Agree, if Apple ever goes back to offering 64+ GB RAM Notebooks, they will become a viable choice again.

simonh · on Nov 19, 2020

Contrary to the way you and a lot of people have been spinning things recently, Apple has not discontinued the 16" Macbook Pro, and you can buy an Apple laptop today in all their form factors with just as much memory as you could a few weeks ago. The M1 machines are only replacing models that only went up to 16 GB already.

tantalor · on Nov 19, 2020

I thought ML models were tiny, small enough to run on mobile devices.

m463 · on Nov 19, 2020

I believe you have confused training and inference.

Training usually uses large amounts of data to get your system to recognize a pattern. It generally uses huge memory and compute and generates a model.

Inference will use that generated model to recognize the pattern in new data. It uses significantly fewer resources and can run either very fast or on a smaller system.

I believe the speed of inference may be affected by the resources available during training, where more speed/memory for training can produce better models.

tantalor · on Nov 19, 2020

I think it's parent commenter who is confused, complains how hard it is to "squeeze state of the art models into my limited memory", which sounds like inference.

therealmarv · on Nov 19, 2020

Yeah when you've a trained model you can copy it even to mobile devices.

deeviant · on Nov 19, 2020

Very, very much depends on the model.

site-packages1 · on Nov 19, 2020

This would depend on the model

nl · on Nov 19, 2020

Most ML models can be shrunk at the cost of some accuracy, but some models are extremely large.

_dp9d · on Nov 18, 2020

Not yet, but the fact this post exists, that Apple have worked to accelerate TensorFlow, that the Apple Silicon specifically has hardware for these workloads, and they're specifically highlighting the performance on the Mac Pro leads me to think people might be soon.

Let's wait and see what Apple Silicon has in store for the iMac Pro and Mac Pro.

raverbashing · on Nov 19, 2020

If you think "commercial grade ML" means only big models, that's the wrong way to think already.

You don't have to throw the biggest model you can at any problem you have.

deeeeplearning · on Nov 19, 2020

Since this is referencing Tensorflow performance that implies Deep Learning models which are anything but small so I'm not sure I see your point.

grandmczeb · on Nov 18, 2020

Probably very few if any. I think the main use cases would be convenience for playing around with things locally and students.

megablast · on Nov 18, 2020

I use it to make simple models for apps, because that is what I have.

nojito · on Nov 18, 2020

On device ML is better for privacy.

olliej · on Nov 18, 2020

Wait, so are they using the GPUs or the "neural engine"? because the gpu approach should also work on any other machine right?

roughly · on Nov 18, 2020

The Apple ML Research group announcement has more details: https://machinelearning.apple.com/updates/ml-compute-trainin...

Key line:

  Until now, TensorFlow has only utilized the CPU for training on Mac. The new tensorflow_macos fork of TensorFlow 2.4 leverages ML Compute to enable machine learning libraries to take full advantage of not only the CPU, but also the GPU in both M1- and Intel-powered Macs for dramatically faster training performance.

So, looks like it's faster on both Intel & M1, but the M1 MBP has a much faster GPU than the Intel MBP

microtherion · on Nov 18, 2020

But note that the ANE and the GPU are different facilities.

roughly · on Nov 19, 2020

Correct - I’d guess this is Not using the neural engine.

I don’t know enough about that hardware to hazard a guess about how easy it would be to get that part of the chip involved.

TYPE_FASTER · on Nov 18, 2020

Looks like Tensor Flow now supports ML Compute, which in turn uses the appropriate hardware.

In the case of the Mac Pro, I'm guessing ML Compute is using the GPU.

In the case of the Intel MacBook Pro 13", which as far as I can tell from Apple's site can't be purchased with a discrete GPU, that will be either the Intel Iris GPU, or the CPU.

In the case of the M1 MacBook Pro 13", I'm assuming ML Compute prioritizes the Neural Engine over the GPU (and CPU), but don't know if there are use cases where the GPU would be preferable.

pram · on Nov 18, 2020

It looks like the class only has cpu and gpu types so, as has been the case, figuring out when the 'neural engine' is invoked is remains opaque:

https://developer.apple.com/documentation/mlcompute/mlcdevic...

spookyboomba · on Nov 18, 2020

As the charts compare Intel based Macs vs the new M1 chip, they would be comparing CPU performance.

loukrazy · on Nov 18, 2020

The charts are comparing “accelerated” performance versus what appears to be a cpu only baseline. It is not clear which M1 hardware beyond the CPU is used

tpetry · on Nov 18, 2020

The data compared to a normal priced gpu for a desktop would be interesting. To get a feeling whether the m1 would be interesting for prototyping.

subzerofun · on Nov 18, 2020

From: https://github.com/apple/tensorflow_macos#device-selection-o...

"There is an optional mlcompute.set_mlc_device(device_name=’any') API for ML Compute device selection. The default value for device_name is 'any’, which means ML Compute will select the best available device on your system, including multiple GPUs on multi-GPU configurations."

Hope this is not a stupid question: Does this TF version just work with M1 GPU-Cores / Intels iGPU or also with AMD and supported Nvidia GPUs?

gok · on Nov 18, 2020

It uses Metal, so the AMD GPUs like the ones in the shipping Mac Pro work.

lisnake · on Nov 18, 2020

The way I'm reading this is that M1 in $700 Mac Mini is only two times slower than Mac Pro, which is what, 4-5 thousands USD?

syntaxing · on Nov 18, 2020

Hmm I'm reading this different, the first graph is seconds/batch. So the lower the number should be better. The M1 Mini should perform 4X faster than the Mac Pro Intel in terms of ML training.

samtheprogram · on Nov 18, 2020

I believe you're misreading GP's comment and/or the associated graphs. The graph with 4x performance increase is comparing a MacBook Pro (Intel) with a MacBook Pro (M1). GP is comparing the MacBook Pro (M1) results in said graph with the Mac Pro (Intel) result in the following graph, where the Mac Pro is roughly 2x.

freewizard · on Nov 19, 2020

Sounds like even though it runs on all Macs with macOS 11.x, AMD GPU on Intel Macs are not optimized: https://github.com/apple/tensorflow_macos/issues/7

cs_bot · on Nov 26, 2020

Will I be able to train models (PyTorch, Keras, TF) and do python dev on the M1 MBP?

debbiedowner · on Nov 19, 2020

It's cool, but is it 16bit perhaps? Would be fair to compare to tensorbook of similar cost.

Desktop wise, you can buy a 1080 for <400 on eBay and the latest gen nvidia are obviously much faster for still less cost than the mac pro.

jonplackett · on Nov 19, 2020

Question - ML models are hard to debug because it’s all happening on the GPU. If you have shared memory with the CPU could that make it easier to see what’s going on inside your model?

BobbyJo · on Nov 19, 2020

Is that why they are hard to debug? I thought they were hard to debug because deep learning models consist of lots of black-box-ish components. I've worked with GPU software before, and it is a pain to debug, but you generally don't have to care about that level when using things like tensorflow or pytorch.

skwb · on Nov 19, 2020

Not unless you're writing the backend CUDA code for TF honestly. Most of the time you're accessing your data/models/structure through the TF API so you really aren't doing too much that directly interfaces with the GPU.

To put it another way, when you're developing ML models (99.9% of the time), it's not like people are writing if statements in CUDA.

ed25519FUUU · on Nov 18, 2020

How is the actual acceleration happening here? Is it done via optimizations with the GPU or the usage of the Neural Engine, the Unified Memory or some combination?

kube-system · on Nov 18, 2020

> Until now, TensorFlow has only utilized the CPU for training on Mac. The new tensorflow_macos fork of TensorFlow 2.4 leverages ML Compute to enable machine learning libraries to take full advantage of not only the CPU, but also the GPU in both M1- and Intel-powered Macs for dramatically faster training performance.

It sounds like the performance gains are because they are now using GPU and CPU instead of just CPU.

https://machinelearning.apple.com/updates/ml-compute-trainin...

singhrac · on Nov 18, 2020

Just to clarify - when they say using the GPU on Intel-based Macs, they mean the integrated GPU, not the discrete AMD GPU that my 16in MBP comes with?

devindotcom · on Nov 18, 2020

The tests are on 13-inch MBPs (Intel and M1) so I would assume not.

therealmarv · on Nov 18, 2020

What is better? Macbook with Apple's M1 or (old) Macbook with Intel CPU and external AMD GPU?

jonplackett · on Nov 19, 2020

Depends if you’re working in a coffee shop

cocoa19 · on Nov 19, 2020

Call me old, but I can't understand how people can sit in the most uncomfortable chairs for hours, tilting my head and neck down towards my laptop.

A workday at Starbucks would cause me head and shoulder pain for days.

neurostimulant · on Nov 19, 2020

In my case, I only do that when I'm getting fed up of working on the same place everyday. Working in a coffee shop is not ideal, but the environment is different enough to reset my mood. Obviously I'm not going to work in a coffee shop every day though, and I haven't work on a coffee shop this year due to covid.

sooheon · on Nov 19, 2020

Could this mean Jax support is coming via XLA?

antipaul · on Nov 18, 2020

Again, M1 chip shows impressive performance gains. First graph shows, what, 3-4x improvement over Intel?

jeroenhd · on Nov 19, 2020

Does it? It shows the accelerated performance vs Metal performance on a generic GPU vs performance on a GPU with dedicated ML units on it. It's not a performance improvement "over Intel", it's a performance improvement "over hardware without dedicated ML acceleration in the new version". There's a similar massive performance gain with the Mac Pro example, showing that the new version being tested is able to leverage GPU power much more effectively.

Comparing the dedicated ML hardware to the generic just-decode-video-and-animate-windows-at-decent-speeds Intel Iris doesn't really prove any performance gain. There's no apples to apples comparison to be made here. If Apple and Nvidia would finally get over themselves, you might be able to make a fair comparison between the M1 and Intel + CUDA. The Intel "pro" laptop they're comparing against only comes with a 1.7GHz Intel CPU and only Intel's mediocre integrated graphics, so I don't see why you would use that version for anything professional related to ML anyway; at best you'd use it to test your pipelines.

All you can conclude from this is that the new M1 chips perform better at TensorFlow than the Intel chips + GPUs in the previous models of Macbook after Apple made a Mac-optimised version that better leverages GPU power.

rubatuga · on Nov 18, 2020

Why would I use tensorflow when pytorch exists?

marcyb5st · on Nov 19, 2020

Disclaimer, I work for Google, but not on Tensorflow.

I agree that in TF 1.x the ergonomics are/were bad, but things improved considerably in TF 2.x. In the case you don't like that either, the tf.keras package offer another more friendlier option that integrates well with the other tf packages (e.g. tf.data, tf.estimators, ...).

Finally, I believe the decision also depends on your use case: If you are just experimenting, maybe PyTorch gets to your results faster. However, I also I think that being able to use TFX (https://www.tensorflow.org/tfx) seamlessly saves you a lot of time when you need to put your models in production.

alibrarydweller · on Nov 19, 2020

As someone who has been trying to set up a simple use case of TFX for six months I respectfully disagree :)

fxtentacle · on Nov 18, 2020

Because you're Google? Or you work with them or their partners?

Or because you like their approach?

Also, in my experience an optimized XLA export is usually faster than Pytorch.