Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, Alex Aiken

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Scopus citations

Abstract

This paper presents Unity, the first system that jointly optimizes algebraic transformations and parallelization in distributed DNN training. Unity represents both parallelization and algebraic transformations as substitutions on a unified parallel computation graph (PCG), which simultaneously expresses the computation, parallelization, and communication of a distributed DNN training procedure. Optimizations, in the form of graph substitutions, are automatically generated given a list of operator specifications, and are formally verified correct using an automated theorem prover. Unity then uses a novel hierarchical search algorithm to jointly optimize algebraic transformations and parallelization while maintaining scalability. The combination of these techniques provides a generic and extensible approach to optimizing distributed DNN training, capable of integrating new DNN operators, parallelization strategies, and model architectures with minimal manual effort. We evaluate Unity on seven real-world DNNs running on up to 192 GPUs on 32 nodes and show that Unity outperforms existing DNN training frameworks by up to 3.6× while keeping optimization times under 20 minutes. Unity is available to use as part of the open-source DNN training framework FlexFlow at https://github.com/flexflow/flexflow.

Original languageEnglish
Title of host publicationProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PublisherUnknown Publisher
Pages267-284
Number of pages18
ISBN (Electronic)9781939133281
StatePublished - 2022
Event16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 -
Duration: Jan 1 2022 → …

Publication series

NameProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022

Conference

Conference16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Period01/1/22 → …

Fingerprint

Dive into the research topics of 'Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization'. Together they form a unique fingerprint.

Cite this