Intro to Distributed Deep Learning Systems

What is distributed machine learning?

What problems is distributed machine learning research trying to solve?

How to use statistics and optimization theory and algorithms*

  • How long does it take for the optimization process to reach a convergence? Or, in other words, what is the convergence speed (or rate)?
  • How good is the converged solution?
  • How much training data are needed to guarantee a good solution?
  • Through distributed or parallel training, are our models and parameters guaranteed to converge on the same state as without acceleration?
  • If they don’t converge on the same state, then how far are we from the original solution, and how far are we from the true optimal solution?
  • What other assumptions/conditions are needed to reach a “good” convergence?
  • How much faster (i.e. scalability) we can get if we compare distributed training to non-distributed training? How can we evaluate this?
  • How can we design the training process (e.g. data sampling, parameter updating) to ensure both good scalability and good convergence?

How to develop ML models or training algorithms that are more suitable for distributed settings**

How to build large-scale DML applications

How to develop parallel or distributed computer systems to scale up ML

  • Consistency: How can we ensure the consensus of multiple nodes if they are simultaneously working toward one goal? What if, for example, they are solving one optimization problem together, but with different partitions of the dataset?
  • Fault tolerance: If we distribute our workload to a cluster of 1000 computational nodes, what if one of the 1000 nodes crashes? Is there a way to fix it other than just restarting the job from the very beginning?
  • Communication: ML involves a lot of I/O (e.g. disk read and write) and data processing procedures — can we design storage systems to enable faster I/O and non-blocking data processing procedures for different types of environments (e.g. single node local disk, distributed file systems, CPU I/O, GPU I/O, etc.)?
  • Resource management: Building a computer cluster is prohibitively expensive so a cluster is usually shared by many users. How should we manage the cluster and allocate resources appropriately to meet everyone’s requests while maximizing usage?
  • Programming model: Should we program distributed ML models/algorithms in the same way we do for non-distributed ones? Could we design a new programming model that requires less coding and improves efficiency? Can we program in a single-node fashion while automatically amplifying the program with distributed computing techniques?

Understanding Distributed Deep Learning

Data Parallelism

Model Parallelism****

Problems in distributed machine learning

*Here’s a good paper to learn more:

**I would recommend the following papers:

***Two notable works on large-scale distributed deep learning were published in NIPS 2012 and 2013 ICML.

****If you are particularly interested in model parallelism, here are two notable papers:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Petuum, Inc.

Petuum, Inc.

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform petuum.com