There is a clear theoretical reason for using 4 layers vs 3. It allows for more ...

Houshalter · on June 16, 2017

>There is a clear theoretical reason for using 4 layers vs 3. It allows for more degrees of freedom which translates to a higher VC dimension.

But then why does using 5 layers work worse than 4? Your theory is no good at predicting what the hyperparameters should be. The only way to find the correct hyperparameters is through empirical search.

>there is much more than simple derivatives in deep learning. For example regularization can yield quadratic programming problems. Different optimization algorithms can have tremendous impact on training time and model performance.

All these concepts are fairly simple also and can be expressed with little math. Additionally, a casual user doesn't need to have a deep understanding of them and the library will usually take care of it. Any more than a programmer needs to have a deep understanding of how an optimizing compiler works.

>More ingenious architectures like GAN also require some fairly technical thinking to get right.

The idea of using NNs to trick each other, is also fairly simple. It doesn't even involve any math.