TY - GEN
T1 - Unity
T2 - 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
AU - Unger, Colin
AU - Jia, Zhihao
AU - Wu, Wei
AU - Lin, Sina
AU - Baines, Mandeep
AU - Narvaez, Carlos Efrain Quintero
AU - Ramakrishnaiah, Vinay
AU - Prajapati, Nirmal
AU - McCormick, Pat
AU - Mohd-Yusof, Jamaludin
AU - Luo, Xi
AU - Mudigere, Dheevatsa
AU - Park, Jongsoo
AU - Smelyanskiy, Misha
AU - Aiken, Alex
PY - 2022
Y1 - 2022
N2 - This paper presents Unity, the first system that jointly optimizes algebraic transformations and parallelization in distributed DNN training. Unity represents both parallelization and algebraic transformations as substitutions on a unified parallel computation graph (PCG), which simultaneously expresses the computation, parallelization, and communication of a distributed DNN training procedure. Optimizations, in the form of graph substitutions, are automatically generated given a list of operator specifications, and are formally verified correct using an automated theorem prover. Unity then uses a novel hierarchical search algorithm to jointly optimize algebraic transformations and parallelization while maintaining scalability. The combination of these techniques provides a generic and extensible approach to optimizing distributed DNN training, capable of integrating new DNN operators, parallelization strategies, and model architectures with minimal manual effort. We evaluate Unity on seven real-world DNNs running on up to 192 GPUs on 32 nodes and show that Unity outperforms existing DNN training frameworks by up to 3.6× while keeping optimization times under 20 minutes. Unity is available to use as part of the open-source DNN training framework FlexFlow at https://github.com/flexflow/flexflow.
AB - This paper presents Unity, the first system that jointly optimizes algebraic transformations and parallelization in distributed DNN training. Unity represents both parallelization and algebraic transformations as substitutions on a unified parallel computation graph (PCG), which simultaneously expresses the computation, parallelization, and communication of a distributed DNN training procedure. Optimizations, in the form of graph substitutions, are automatically generated given a list of operator specifications, and are formally verified correct using an automated theorem prover. Unity then uses a novel hierarchical search algorithm to jointly optimize algebraic transformations and parallelization while maintaining scalability. The combination of these techniques provides a generic and extensible approach to optimizing distributed DNN training, capable of integrating new DNN operators, parallelization strategies, and model architectures with minimal manual effort. We evaluate Unity on seven real-world DNNs running on up to 192 GPUs on 32 nodes and show that Unity outperforms existing DNN training frameworks by up to 3.6× while keeping optimization times under 20 minutes. Unity is available to use as part of the open-source DNN training framework FlexFlow at https://github.com/flexflow/flexflow.
UR - http://www.scopus.com/inward/record.url?scp=85140355399&partnerID=8YFLogxK
M3 - Conference contribution
T3 - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
SP - 267
EP - 284
BT - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PB - Unknown Publisher
Y2 - 1 January 2022
ER -