Background: Zeroth-order (ZO) optimization is a powerful alternative to gradient-based learning methods where backpropagation is impossible or infeasible. This is the case in black box settings or where computing the gradient is expensive. It does this by constructing an estimator of the gradient using only forward passes. By skipping backpropagation, ZO methods enjoy reduced memory consumption and can be used where gradient information is unavailable. With the rise of large language models, ZO optimization can offer a cheaper alternative to train them.
There is a catch: ZO gradient estimates have variance that scales linearly with the problem's dimensionality, leading to poor convergence when training large models:
My research involves scaling up ZO optimization to larger problems than it can generally handle. To do this, I'm utilizing a paradigm called meta-learning, also known as "learning to learn." The core idea is to learn a model that reduces the variance of the ZO gradient estimate; this model is called a meta-optimizer. We can then use this improved estimate to solve the large problem with better convergence:
Currently, I'm researching different meta-optimizer architectures and testing them on ZO training of compressed language models.