Publications

Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients

Published in Neurips (under review), 2025

Building upon the success of low-rank adapter (LoRA), low-rank gradient projection (LoRP) has emerged as a promising solution for memory-efficient fine-tuning. However, existing LoRP methods typically treat each row of the gradient matrix as the default projection unit, leaving the role of projection granularity underexplored. In this work, we propose a novel framework, VLoRP, that extends low-rank gradient projection by introducing an additional degree of freedom for controlling the trade-off between memory efficiency and performance, beyond the rank hyper-parameter. Through this framework, we systematically explore the impact of projection granularity, demonstrating that finer-grained projections lead to enhanced stability and efficiency even under a fixed memory budget. Regarding the optimization for VLoRP, we present ProjFactor, an adaptive memory-efficient optimizer, that significantly reduces memory requirement while ensuring competitive performance, even in the presence of gradient accumulation. Additionally, we provide a theoretical analysis of VLoRP, demonstrating the descent and convergence of its optimization trajectory under both SGD and ProjFactor. Extensive experiments are conducted to validate our findings, covering tasks such as commonsense reasoning, MMLU, and GSM8K.

Recommended citation: Yezhen Wang, Zhouhao Yang, et al. "Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients." arXiv preprint arXiv:2505.01744 (2025).

Statistical Mean Estimation with Coded Relayed Observations

Published in Under review in IEEE Transactions on Information Theory, 2025

We consider a problem of statistical mean estimation in which the samples are not observed directly, but are instead observed by a relay (“teacher”) that transmits information through a memoryless channel to the decoder (“student”), who then produces the final estimate. We consider the minimax estimation error in the large deviations regime, and establish achievable error exponents that are tight in broad regimes of the estimation accuracy and channel quality. In contrast, two natural baseline methods are shown to yield strictly suboptimal error exponents. We initially focus on Bernoulli sources and binary symmetric channels, and then generalize to sub-Gaussian and heavy-tailed settings along with arbitrary discrete memoryless channels.

Recommended citation: Yan Hao Ling, Zhouhao Yang, and Jonathan Scarlett. "Statistical Mean Estimation with Coded Relayed Observations." arXiv preprint arXiv:2505.09098 (2025).

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Published in Neurips, 2024

In this paper, we introduce Forward Gradient Unrolling with Forward Gradient, abbreviated as $(FG)^2U$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization.

Recommended citation: Qianli Shen, Yezhen Wang, Zhouhao Yang, et al. "Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization." arXiv preprint arXiv:2406.14095 (2024).

Bias-Variance Trade-off in Physics-Informed Neural Networks with Randomized Smoothing for High-Dimensional PDEs

Published in SIAM Journal on Scientific Computing, 2023

In this paper, we present a comprehensive analysis of biases in RS-PINN, attributing them to the nonlinearity of the Mean Squared Error (MSE) loss as well as the intrinsic nonlinearity of the PDE itself. We also propose tailored bias correction techniques, delineating their application based on the order of PDE nonlinearity. The derivation of an unbiased RS-PINN allows for a detailed examination of its advantages and disadvantages compared to the biased version.

Recommended citation: Zheyuan Hu, Zhouhao Yang et al. "Bias-variance trade-off in physics-informed neural networks with randomized smoothing for high-dimensional PDEs." arXiv preprint arXiv:2311.15283 (2023).

Distributionally Robust Policy Gradient for Offline Contextual Bandits

Published in AISTATS 2023, 2023

In this paper, we employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly.

Recommended citation: Zhouhao Yang, et al. "Distributionally robust policy gradient for offline contextual bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

A General Framework for Accurate and Private Mean Estimation

Published in IEEE Signal Processing Letters, 2022

In this letter, we present a differentially private algorithm which accurately estimates the mean of an underlying population with given cumulative distribution function.

Recommended citation: Zhouhao Yang, Xingyu Xu, and Yuantao Gu. "A general framework for accurate and private mean estimation." IEEE Signal Processing Letters 29 (2022): 2293-2297.