Performance improvements for Machine Learning / Cuda implementation
We are developing a novel deep learning method for 3d data. Like in deep neural networks, we employ automatic differentiation and gradient descent to optimise the parameters of a function (see youtube).
Currently we work with pytorch, a popular python framework for deep learning. It provides automatic differentiation, but the performance can be further improved by implementing certain functions in C++/CUDA. That is the task for this project.
I already implemented one of the needed functions and got an improvement of about 10x, but that can be further improved. There are also other functions that would benefit by porting them from pytorch to CUDA. Such a port would require to derive and implement the gradient computation, in which I would help you.
- Make a CMake / C++ project to start, unit test, and benchmark the newly written kernels
- Use the CUDA performance and profile tools to measure and optimise the existing kernel
- Implement further kernels.
- C++ and CUDA (essential)
- Profiling, performance optimisation and CUDA tools (good)
- Python and pytorch (advantageous, but can be learned by doing)
- Written English
- Windows or Linux
- NVIDIA GPU
- CMake, C++, CUDA and a bit of python / pytorch