Penalised regression improves imputation of cell-type specific expression using RNA-seq data from mixed cell populations compared to domain-specific methods

Wei-Yu Lin; Melissa Kartawinata; Bethany R Jebson; Restuadi Restuadi; CLUSTER Consortium; Lucy R Wedderburn; Chris Wallace

doi:10.1101/2023.09.11.556650

Abstract

Differential gene expression (DGE) studies often use bulk RNA sequencing of mixed cell populations because single cell or sorted cell sequencing may be prohibitively expensive. However, mixed cell studies may miss differential expression that is restricted to specific cell populations. Computational deconvolution can be used to estimate cell fractions from bulk expression data and infer average cell-type expression in a set of samples (eg cases or controls), but imputing sample-level cell-type expression is required for quantitative traits and is less commonly addressed.

Here, we assessed the accuracy of imputing sample-level cell-type expression using a real dataset where mixed peripheral blood mononuclear cells (PBMC) and sorted (CD4, CD8, CD14, CD19) RNA sequencing data were generated from the same subjects (N=158). We compared three domain-specific methods, CIBERSORTx, bMIND and debCAM/swCAM, and two cross-domain machine learning methods, multiple response LASSO and RIDGE, that had not been used for this task before.

LASSO/RIDGE showed higher sensitivity but lower specificity for recovering DGE signals seen in observed data compared to deconvolution methods, although LASSO/RIDGE had higher area under curves (median=0.84-0.87 across cell types) than deconvolution methods (0.62-0.77). Machine learning methods have the potential to outperform domain-specific methods when suitable training data are available.

Competing Interest Statement

The CLUSTER consortium has been provided with generous grants from AbbVie and Sobi. CW receives funding from MSD and GSK and is a part-time employee of GSK. These companies had no involvement in the work presented here.

Footnotes

We have updated the title to represent our main discoveries better. We have also enhanced the quality of the figures and refined captions to improve clarity and conciseness. The original Fig4(a) and supFig7 showed the same results, so we have kept supFig7 (now SupFig4) and presented the original Fig4(b) results in SupFig5. The gating strategy for sorting cells is provided as SupFig10. We have also adjusted the numbering of the supplementary figures to reflect their placement in the main text. Additionally, we revised the first paragraph of the Discussion and rounded the numbers in SupTab1 to the integer.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.