Performance evaluation of Horovod Cluster in ELKH Cloud

Distributed deep learning is becoming increasingly important due to the size of deep neural networks. The sheer volume of the input datasets used in the process can have a significant negative effect on the training time of the model. As a consequence of physical limitations, the resources of a single node cannot be scaled to a sufficient level for efficient training. As such, different parallel and distributed solutions exist in order to solve this problem. In this paper, we measured the scalability of Horovod, a distributed deep learning framework (available as a reference architecture) with different parameters on the research infrastructure called ELKH Cloud. The experiment was conducted in order to verify the scalability of the framework on the general purpose infrastructure, which was not designed pri- marily for distributed deep learning applications. Measurements assessing model accuracy were also performed, for the purpose of validating the distributed training process.

DOI: 10.1109/ICCC202255925.2022.9922765

If you use the following methodology, dataset or result, please cite the article:

A. Farkas, K. Póra, S. Szénási, G. Kertész and R. Lovas, "Evaluation of a distributed deep learning framework as a reference architecture for a cloud environment," 2022 IEEE 10th Jubilee International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC), Reykjavík, Iceland, 2022, pp. 000083-000088, doi: 10.1109/ICCC202255925.2022.9922765.

Data and Resources

Additional Info

Field Value
Author Krisztián Póra
Maintainer Krisztián Póra
Last Updated April 4, 2023, 10:29 (UTC)
Created April 4, 2023, 10:29 (UTC)