PaddlePaddle EDL: Elastic Deep Learning
While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes
the global utilization of the cluster, and
the waiting time of job submitters.
For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.
EDL includes two parts:
a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and
making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.
We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University.