Abstract: As the scale of distributed training increases, it brings huge communication overhead in clusters. Some works try to reduce the communication cost through gradient compression or ...