In the field of activity Data Analysis and Machine Learning we denote this ongoing development by increased research activities on the scalability of large ML problems. Currently we focus on the distributed parallelization of optimization methods used to train large ML models. Our approaches, i.e. the Asynchronous Stochastic Gradient Descent (ASGD)  solver, are based on the existing CC-HPC tools like our asynchronous communication framework GPI 2.0 and the distributed file system BeeGFS.
DLPS: Deep Learning in the Cloud
Our pre-configured and optimized Caffe instances make Deep Learning available on demand. Providing custom data layers optimized for shared BeeGFS storage of large training data and models. With DLPS, we introduce a scalable and failsafe automatic meta parameter optimization for Caffe Deep Learning models in in the cloud.
Key features are:
- Automatically launching and scaling Caffe in the cloud
- Automatic Meta-Parameter Search
- Optimized data layers for BeeGFS distributed on demand storage
SGD and ASGD - Scalable Deep Learning build on HPC Technology
Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning algorithms. In the context of large scale learning, as utilized by many Big Data applications, the efficient parallelization of SGD on distributed systems is a key performance factor.
We offer scalable implementations of state of the art synchronous SDG algorithms for the distributed CPU and GPU based training of large Caffe models on HPC infrastructure. With our Asynchronous Stochastic Gradient Descent optimization algorithm (ASGD) we introduced a new algorithm, that is able to efficiently parallelize SGD on distributed filesystems. ASGD outperforms current, mostly MapReduce based, parallel SGD algorithms in solving the optimization task for large scale machine learning problems in distributed memory environments. We were able show, that ASGD is faster, has better convergence and scaling properties and leads to better error rates than other state of the art methods. With ASGD, non-convex optimization problems in high-dimensional parameter spaces can effectively be parallelized over hundreds or thousands of CPU and GPU nodes.
Our version of Caffe is build on top of our HPC Core-Technologies:
- the asynchronous RDMA bases communication in GPI-2
- automatic parallelization
- data and workflow management within GPI-SPACE
- scalable distributed filesystems on demand with BeeGFS
All of our HPC-Tools are Open Source.