Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Accelerating TensorFlow Performance on Mac (tensorflow.org)
165 points by tomduncalf on Nov 18, 2020 | hide | past | favorite | 82 comments


"On my models currently in test, it’s like using a 1080 or 1080 TI." https://twitter.com/spurpura/status/1329168059647488000

That's amazing.


Is it training or inference? Inference only is amazing enough, and I don't expect M1 to beat 1080 for training deep neural networks.


The benchmarks here are for training it says.


This is funny - it clearly points to Intel & AMD as the culprits to why we didn't have a CUDA alternative for this long. Not Nvidia as long reviled.

If Apple can work with...Google to get their framework changed for the M1, then there is absolutely no excuse for Intel/AMD. They had a decade to fix this.

They deserve their fate.


Surely this is funny: too many Apple fanboys get on the hype train.

AMD had been working with Google to get their framework changed for AMD's GPU for more than two years [1], and all their work are upstreamed. Oh, and AMD's ROCm/OpenCL support is really for general computing, i.e. CUDA alternative, unlike the ML Compute here. ML Compute is something Apple created specifically for running neural networks, nothing more, and roughly equivalent to TensorRT / Android NN if you want to compare with other platforms. And it was here because the wall of Apple's walled garden is too high that nobody other than them can effectively optimize NN inference/training on their chip.

[1] https://pypi.org/project/tensorflow-rocm/#history


How it can be a CUDA alternative when their driver support is still a joke and not available in all OSes?


FYI - i have never used a macbook in my life. I only use XPS with Fedora. So this frustration is personal.

I have been party to get AMD/Intel's CUDA alternative out the door on some of the ML libraries - which one is it now ? OpenCL...SYCL ...ROCm...PlaidML ? I cant remember.

All this time I was pissed at nvidia - surely they were playing subversive politics to kill all of this. With so many initiatives, surely AMD/Intel had their heart in the right place.

Apple and Google are cutthroat rivals. And they worked together for a just-released chip to get fully working acceleration support.

Here's where it gets sadder for me - Tensorflow has included a GPU accelerated version of Numpy ( https://twitter.com/fchollet/status/1292893864986984448?lang...) . Numpy itself is only accelerated using BLAS/LAPACK which cant leverage GPU all that well.

https://www.tensorflow.org/api_docs/python/tf/experimental/n...

At this point, it is basically a sealed deal - if you're even remotely dabbling in data science, you better be working on a Mac.

Which makes me hate my XPS all the more :(


Tensorflow is accelerated just fine on AMD thanks to ROCm - everything you're praising Apple for doing, AMD has done for their hardware. You're disparaging AMD for not doing something they have actually done. There was even a tensorflow blog article announcing upstreaming AMD support, in 2018[0]

> At this point, it is basically a sealed deal - if you're even remotely dabbling in data science, you better be working on a Mac.

This is hilarious. You might have missed that the benchmark was missing Nvidia (or even AMD) graphics cards; I can't think of a lower bar for comparing ML performance than against Intel GPUs - perhaps Intel CPUs? While Apple has brilliant engineering, the M1 cannot possibly outperform the obscene number of transistors Nvidia & AMD throw at the task, even in older, mid-range cards. Not to mention power dissipation.

If you're dabbling, you're better off with Google's Colab[1] which has (free) hardware acceleration which is roughly on par with my 3-year-old RX580 for my Tensorflow projects. Colab will work on anything that can run a browser.

0. https://blog.tensorflow.org/2018/08/amd-rocm-gpu-support-for...

1. https://colab.research.google.com/


> You might have missed that the benchmark was missing Nvidia (or even AMD) graphics cards;

The post does include a benchmark for an AMD GPU (Radeon Pro Vega II Duo) on the Mac Pro. Comparing the Mac Pro GPU vs. MBP M1 results, the GPU clearly wins, although in some cases the margin isn't as large as you might expect.


It's hard to reconcile the performance across the 2 tests, perhaps they were set up/tuned differently. I wish they had published their methodology - I'd have loved to benchmark my long-in-the-tooth RX580 & rocm-tensorflow against their numbers


Cool

Are they getting competitive results?


Which is 100% correct.

NVidia cared to make CUDA into a polyglot GPGPU programming model, with nice debugging tools where you can do everything like CPU graphical debuggers, then thanks to PTX it was a matter to just add a new backend to your compiler.

Hence why, even though it doesn't get that much press, you can even use flavours of Java and .NET on CUDA.

Meanwhile Khronos kept driving their C only agenda, and when they realized the mistake, came up with SPIR (then SPIR-V after Vulkan was introduced), tried to also cater to the C++ devs (with printf like debugging tools).

All this effort was largely ignored by OEMs, with their lousy tools, thus ending with OpenCL 3.0 being effectively OpenCL 1.2 renamed to sound cool, and the C++ efforts (SYSCL) are now focusing on compute agnostic backends.

The problem wasn't NVidia, rather Intel and AMD did not deliver and all their alternatives to OpenCL are even worse, half backed attempts that always loose steam half way through.


I wish they would compare to a 1080Ti or something along those lines. Nevertheless, this is pretty neat minus the RAM implications because realistically, I'm guessing you can use 5GB of RAM at most for training. I always wanted a computer where I can test the model for a couple batches (that doesn't cost $1200+) and then push to the cloud to train for scalability.


Seems i am not the only one who wants to know whether the M1 neural engine is performant enough for prototyping :)

It will be interesting how long the other deep learning frameworks will need to support the M1. Pytorch has not yet achieved comparable performance on a TPU compared to tensorflow.


It would be pretty awesome to write an app, train the network, and deploy it using the same computer! I sometimes try to use google Colab to train some fun stuff (like style transfer) but sometimes it gives me a headache (grateful it is free though!!) It's hard to predict if the session will just stop and persistent data is a pain. Having an affordable (ironic using this word with Apple) for ML prototyping would be amazing.


You know, they don’t mention the neural engine. I know very little of ML but maybe the neural engine isn’t helpful on the training side?


The API of the neural engine is closed for some reason.


really? So WHO can use it then when not normal developers? Only Apple?


It is not. See https://developer.apple.com/documentation/coreml and https://github.com/apple/coremltools.

The neural engine on the Apple A11 wasn't exposed to apps at least at launch, but that's no longer a thing on A12 onwards.


My gut instincts (based off of nothing but pure speculation and being a ML developer for several years) says no.

For one, the M1 engine has 16 cores, vs over 1,500 for a 1660 series NVIDA card. I know it may not be apples to apples, but I have a very hard time believing it would be able to keep up with even the most marginal card for training.


I don’t know about the neural engine, but the GPU on the M1 has 128 EUs in each of the 8 cores, making for 1024 EUs total. It’s not quite in the same league as a 1660 Ti but it’s close: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...


> Nevertheless, this is pretty neat minus the RAM implications because realistically, I'm guessing you can use 5GB of RAM at most for training

This will only be an issue for the first few months of the transition to Apple Silicon. Upcoming updates to the rest of the lineup will undoubtedly have more unified memory


I'm torn because of this. The mac mini just got on sale at Costco for $649. Really tempted to purchase it now but intuition is telling me to wait for the 16GB RAM.


You can already spec the M1 with 16GB, it’s just pretty pricey.

I don’t even want to think what the 32GB and above will cost.


I mean, I got the 15" 2018 MBP with 32GB i9 500GB w/ dedicated graphics for $3,899.00. A MBP 13" with 16GB and 1TB is only $1,899.00. I'd fall over myself to buy a M2/M1Z (whatever they call it) with 32-64GB ram and 1TB ssd for as much as I paid for my last MBP. I honestly couldn't be happier that the new 13 MBP doesn't fit my needs or I'd be very tempted to get one. Thankfully now I can wait for the virtualization story to stabilize, wait for more ram, and wait for more external display support (all of which I need).

I know that Apple charges a premium for their upgrades but from everything I've seen so far from the new 13" MBP I'll be happy to pay that premium this time around. I have always found the comparisons of ram/ssd in recent Apple MBP's to generic parts to be slightly disingenuous. Raw speeds don't really matter as much as how well everything works together in my experience and the new AS computers seem to take that to an even higher level.


It's not quite what you're asking for, but there's a comparison to the 1050Ti: https://www.macrumors.com/2020/11/16/m1-beats-geforce-gtx-10...


Does graphic performance indicated ML training performance (to a certain degree)? Is it fair to say that if graphically it performs better, the ML training will be better with the proper software support?


Does anyone doing serious commercial grade ML actually train models on their macbook?


When prototyping or developing new model code, yes. If you’re building production ML systems, there’s plenty code to write and test before deploying to prod infrastructure.

To be clear, you would work with a very small data sample or synthetic data as your objective isn’t to train a model for production use.

Edit: clarity and grammar


Yeah, exactly this. I mostly am using cloud VMs for training, but sometimes just need to be able to mess around on my local machine for my deployment pipeline. I don't really care if it takes ~10 seconds to do inference on a handful of images, I really just need to tinker with the interfaces to ensure everything is hooked up correctly.


Surely not. I'm working on a PC with 64GB of main RAM and 12GB of dedicated GPU RAM and I already have to do tricks to be able to squeeze state of the art models into my limited memory.

For comparison, NVIDIA just upgraded from 40GB to 80GB.


Unified memory on Apple silicon means the gpu/neural engine has as much ram as the mac. A future mbp with 64 gb ram will be able to fit a lot of models. A whole ML laptop for the price of one gpu.


Agree, if Apple ever goes back to offering 64+ GB RAM Notebooks, they will become a viable choice again.


Contrary to the way you and a lot of people have been spinning things recently, Apple has not discontinued the 16" Macbook Pro, and you can buy an Apple laptop today in all their form factors with just as much memory as you could a few weeks ago. The M1 machines are only replacing models that only went up to 16 GB already.


I thought ML models were tiny, small enough to run on mobile devices.


I believe you have confused training and inference.

Training usually uses large amounts of data to get your system to recognize a pattern. It generally uses huge memory and compute and generates a model.

Inference will use that generated model to recognize the pattern in new data. It uses significantly fewer resources and can run either very fast or on a smaller system.

I believe the speed of inference may be affected by the resources available during training, where more speed/memory for training can produce better models.


I think it's parent commenter who is confused, complains how hard it is to "squeeze state of the art models into my limited memory", which sounds like inference.


Yeah when you've a trained model you can copy it even to mobile devices.


Very, very much depends on the model.


This would depend on the model


Most ML models can be shrunk at the cost of some accuracy, but some models are extremely large.


Not yet, but the fact this post exists, that Apple have worked to accelerate TensorFlow, that the Apple Silicon specifically has hardware for these workloads, and they're specifically highlighting the performance on the Mac Pro leads me to think people might be soon.

Let's wait and see what Apple Silicon has in store for the iMac Pro and Mac Pro.


If you think "commercial grade ML" means only big models, that's the wrong way to think already.

You don't have to throw the biggest model you can at any problem you have.


Since this is referencing Tensorflow performance that implies Deep Learning models which are anything but small so I'm not sure I see your point.


Probably very few if any. I think the main use cases would be convenience for playing around with things locally and students.


I use it to make simple models for apps, because that is what I have.


On device ML is better for privacy.


Wait, so are they using the GPUs or the "neural engine"? because the gpu approach should also work on any other machine right?


The Apple ML Research group announcement has more details: https://machinelearning.apple.com/updates/ml-compute-trainin...

Key line:

  Until now, TensorFlow has only utilized the CPU for training on Mac. The new tensorflow_macos fork of TensorFlow 2.4 leverages ML Compute to enable machine learning libraries to take full advantage of not only the CPU, but also the GPU in both M1- and Intel-powered Macs for dramatically faster training performance.
So, looks like it's faster on both Intel & M1, but the M1 MBP has a much faster GPU than the Intel MBP


But note that the ANE and the GPU are different facilities.


Correct - I’d guess this is Not using the neural engine.

I don’t know enough about that hardware to hazard a guess about how easy it would be to get that part of the chip involved.


Looks like Tensor Flow now supports ML Compute, which in turn uses the appropriate hardware.

In the case of the Mac Pro, I'm guessing ML Compute is using the GPU.

In the case of the Intel MacBook Pro 13", which as far as I can tell from Apple's site can't be purchased with a discrete GPU, that will be either the Intel Iris GPU, or the CPU.

In the case of the M1 MacBook Pro 13", I'm assuming ML Compute prioritizes the Neural Engine over the GPU (and CPU), but don't know if there are use cases where the GPU would be preferable.


It looks like the class only has cpu and gpu types so, as has been the case, figuring out when the 'neural engine' is invoked is remains opaque:

https://developer.apple.com/documentation/mlcompute/mlcdevic...


As the charts compare Intel based Macs vs the new M1 chip, they would be comparing CPU performance.


The charts are comparing “accelerated” performance versus what appears to be a cpu only baseline. It is not clear which M1 hardware beyond the CPU is used


The data compared to a normal priced gpu for a desktop would be interesting. To get a feeling whether the m1 would be interesting for prototyping.


From: https://github.com/apple/tensorflow_macos#device-selection-o...

"There is an optional mlcompute.set_mlc_device(device_name=’any') API for ML Compute device selection. The default value for device_name is 'any’, which means ML Compute will select the best available device on your system, including multiple GPUs on multi-GPU configurations."

Hope this is not a stupid question: Does this TF version just work with M1 GPU-Cores / Intels iGPU or also with AMD and supported Nvidia GPUs?


It uses Metal, so the AMD GPUs like the ones in the shipping Mac Pro work.


The way I'm reading this is that M1 in $700 Mac Mini is only two times slower than Mac Pro, which is what, 4-5 thousands USD?


Hmm I'm reading this different, the first graph is seconds/batch. So the lower the number should be better. The M1 Mini should perform 4X faster than the Mac Pro Intel in terms of ML training.


I believe you're misreading GP's comment and/or the associated graphs. The graph with 4x performance increase is comparing a MacBook Pro (Intel) with a MacBook Pro (M1). GP is comparing the MacBook Pro (M1) results in said graph with the Mac Pro (Intel) result in the following graph, where the Mac Pro is roughly 2x.


Sounds like even though it runs on all Macs with macOS 11.x, AMD GPU on Intel Macs are not optimized: https://github.com/apple/tensorflow_macos/issues/7


Will I be able to train models (PyTorch, Keras, TF) and do python dev on the M1 MBP?


It's cool, but is it 16bit perhaps? Would be fair to compare to tensorbook of similar cost.

Desktop wise, you can buy a 1080 for <400 on eBay and the latest gen nvidia are obviously much faster for still less cost than the mac pro.


Question - ML models are hard to debug because it’s all happening on the GPU. If you have shared memory with the CPU could that make it easier to see what’s going on inside your model?


Is that why they are hard to debug? I thought they were hard to debug because deep learning models consist of lots of black-box-ish components. I've worked with GPU software before, and it is a pain to debug, but you generally don't have to care about that level when using things like tensorflow or pytorch.


Not unless you're writing the backend CUDA code for TF honestly. Most of the time you're accessing your data/models/structure through the TF API so you really aren't doing too much that directly interfaces with the GPU.

To put it another way, when you're developing ML models (99.9% of the time), it's not like people are writing if statements in CUDA.


How is the actual acceleration happening here? Is it done via optimizations with the GPU or the usage of the Neural Engine, the Unified Memory or some combination?


> Until now, TensorFlow has only utilized the CPU for training on Mac. The new tensorflow_macos fork of TensorFlow 2.4 leverages ML Compute to enable machine learning libraries to take full advantage of not only the CPU, but also the GPU in both M1- and Intel-powered Macs for dramatically faster training performance.

It sounds like the performance gains are because they are now using GPU and CPU instead of just CPU.

https://machinelearning.apple.com/updates/ml-compute-trainin...


Just to clarify - when they say using the GPU on Intel-based Macs, they mean the integrated GPU, not the discrete AMD GPU that my 16in MBP comes with?


The tests are on 13-inch MBPs (Intel and M1) so I would assume not.


What is better? Macbook with Apple's M1 or (old) Macbook with Intel CPU and external AMD GPU?


Depends if you’re working in a coffee shop


Call me old, but I can't understand how people can sit in the most uncomfortable chairs for hours, tilting my head and neck down towards my laptop.

A workday at Starbucks would cause me head and shoulder pain for days.


In my case, I only do that when I'm getting fed up of working on the same place everyday. Working in a coffee shop is not ideal, but the environment is different enough to reset my mood. Obviously I'm not going to work in a coffee shop every day though, and I haven't work on a coffee shop this year due to covid.


Could this mean Jax support is coming via XLA?


Again, M1 chip shows impressive performance gains. First graph shows, what, 3-4x improvement over Intel?


Does it? It shows the accelerated performance vs Metal performance on a generic GPU vs performance on a GPU with dedicated ML units on it. It's not a performance improvement "over Intel", it's a performance improvement "over hardware without dedicated ML acceleration in the new version". There's a similar massive performance gain with the Mac Pro example, showing that the new version being tested is able to leverage GPU power much more effectively.

Comparing the dedicated ML hardware to the generic just-decode-video-and-animate-windows-at-decent-speeds Intel Iris doesn't really prove any performance gain. There's no apples to apples comparison to be made here. If Apple and Nvidia would finally get over themselves, you might be able to make a fair comparison between the M1 and Intel + CUDA. The Intel "pro" laptop they're comparing against only comes with a 1.7GHz Intel CPU and only Intel's mediocre integrated graphics, so I don't see why you would use that version for anything professional related to ML anyway; at best you'd use it to test your pipelines.

All you can conclude from this is that the new M1 chips perform better at TensorFlow than the Intel chips + GPUs in the previous models of Macbook after Apple made a Mac-optimised version that better leverages GPU power.


Why would I use tensorflow when pytorch exists?


Disclaimer, I work for Google, but not on Tensorflow.

I agree that in TF 1.x the ergonomics are/were bad, but things improved considerably in TF 2.x. In the case you don't like that either, the tf.keras package offer another more friendlier option that integrates well with the other tf packages (e.g. tf.data, tf.estimators, ...).

Finally, I believe the decision also depends on your use case: If you are just experimenting, maybe PyTorch gets to your results faster. However, I also I think that being able to use TFX (https://www.tensorflow.org/tfx) seamlessly saves you a lot of time when you need to put your models in production.


As someone who has been trying to set up a simple use case of TFX for six months I respectfully disagree :)


Because you're Google? Or you work with them or their partners?

Or because you like their approach?

Also, in my experience an optimized XLA export is usually faster than Pytorch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: