Distributed deep learning is becoming increasingly
important due to the size of deep neural networks. The sheer
volume of the input datasets used in the process can have a
significant negative effect on the training time of the model. As
a consequence of physical limitations, the resources of a single
node cannot be scaled to a sufficient level for efficient training.
As such, different parallel and distributed solutions exist in order
to solve this problem. In this paper, we measured the scalability
of Horovod, a distributed deep learning framework (available
as a reference architecture) with different parameters on the
research infrastructure called ELKH Cloud. The experiment was
conducted in order to verify the scalability of the framework on
the general purpose infrastructure, which was not designed pri-
marily for distributed deep learning applications. Measurements
assessing model accuracy were also performed, for the purpose
of validating the distributed training process.
DOI: 10.1109/ICCC202255925.2022.9922765
If you use the following methodology, dataset or result, please cite the article:
A. Farkas, K. Póra, S. Szénási, G. Kertész and R. Lovas, "Evaluation of a distributed deep learning framework as a reference architecture for a cloud environment," 2022 IEEE 10th Jubilee International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC), Reykjavík, Iceland, 2022, pp. 000083-000088, doi: 10.1109/ICCC202255925.2022.9922765.