iSwitch Computing

Network communication can be a significant bottleneck for computing systems and data centers, particularly with the rise of artificial intelligence and the ensuing distributed training applications. For reinforcement learning (RL) training in particular, network communication accounts for up to 83% of execution time. Existing techniques for handling gradient aggregation suffer from latency tradeoffs (e.g., when executed at a central server) or scalability issues (e.g., for circular aggregation methods). A new approach is needed that better addresses network communication latency while leveraging existing data center features, such as programmable switches that allow for some leve of computational capacity. 

Professor Jian Huanghas developed iSwitch, a technology that reduces latency bottlenecks in data center network communication by shifting aggregation processes to programmable switches. iSwitch is particularly relevant to AI applications such as reinforcement learning (RL)-based training, where frequent gradient aggregation typically requires a large number of network hops. In demonstrations, the technology achieved a system-level speedup of more than 3.5x for both synchronous and asynchronous RL distributed training and also improved scalability.

Application

Networking datacenters

Benefits

iSwitch results in a system-level speedup of more than 3.5x for both synchronous and asynchronous RL distributed training and also improves scalability.

Publication 

Y. Li, I. -J. Liu, Y. Yuan, D. Chen, A. Schwing and J. Huang, "Accelerating Distributed Reinforcement learning with In-Switch Computing," 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 279-291.