Disentangling associations between complex traits and cell types with seismic - Nature Communications - MarketAlert – Real-Time Market & Crypto News, Analysis & Alerts

Through a deep exploration of neurological disease-associated brain cell types, we find that cell type definition from input scRNA-seq data is an important, yet currently underappreciated, factor that influences downstream findings. Previous studies6,8,11,12 have typically used broad characterizations of cell types, such as “telencephalon projecting excitatory neurons” and “frontal cortex neurons,” without accounting for finer regional or tissue-specific distinctions. This coarse characterization can obscure valuable biological insights, especially when cell diversity is high. While broad cell type characterizations may be adequate when studying relatively homogeneous cell populations (e.g., microglial cells in the brain), this approach falls short for highly diverse cell populations like neurons. For instance, neurodegenerative diseases preferentially target neurons with very distinct regional and cell type identities. In Alzheimer’s disease, neurons in the entorhinal cortex are especially vulnerable, whereas neurons in even neighboring brain regions, such as the dentate gyrus and CA2/CA3 in the hippocampus, are not13. We show that using finer granularities for cell type characterization reveals more specific cell-type trait links, which better reflects true biological mechanisms. Notably, seismic consistently outperforms other methods in identifying disease-associated cell types across these different cell type characterizations. Furthermore, we demonstrate the importance of considering different GWAS endpoints to reveal disease mechanisms, reporting, to our knowledge, the first computational identification of a neuronal association with an Alzheimer’s disease biomarker (tau level in cerebrospinal fluid). Together, our results expand current notions of best practices for cell type-trait association analyses and provide a methodological toolkit to take fuller advantage of both scRNA-seq and GWAS data to unravel the intricate interplay between tissue/cell type and complex traits.

Many cell-type-trait association methods consider the same inputs — variant-trait information from GWAS resolved to gene-trait relevance using MAGMA and single-cell expression data — to find statistically significant associations between cell types and traits (Fig. 1A). However, these methods may rely on arbitrary gene thresholds or cell-type mean expression profiles to identify trait-implicated cell types (Table 1), without accounting for the global relationship of cell type specificity and disease risk. Here, we introduce a novel integration framework, Single-cell Expression Integration System for Mapping genetically Implicated Cell types (seismic), that overcomes the limitations of previous methods to provide a threshold-free, fast, and interpretable method for combining single cell expression data with gene-trait relationships (Fig. 1B).

At the core of seismic is a cell type-specificity score (“Methods”), which calculates the specificity and consistency of expression for each gene in a cell type relative to all other cell types. The seismic specificity score is designed to compare the relative probability of a gene in a cell type with consistently higher expression than background cells among all cell types (“Methods”), thus providing a global view of gene specificity in a cell type. We empirically assess these scores in various pancreatic cell types in the Tabula Muris FACS datasets. Even though these cell types exhibit highly correlated expression patterns, we find that established marker genes are all ranked among the highest in their corresponding cell types (Supplementary Fig. 1A, B). Moreover, marker genes consistently show significantly higher specificity scores in their target cell types compared to those same marker genes in non-pancreatic cell types or housekeeping genes across all cell types (Supplementary Fig. 1C), highlighting the score’s ability to capture biologically meaningful cell-type-specific patterns.

The seismic specificity score is also robust to different characterizations of cell types, whether it is broader groupings or more specific subclusters, as it is robust to arbitrary re-labeling of homogeneous populations, while exploiting genuine substructure to achieve high resolution in identifying trait-associated cell types (Supplementary Note 1, Supplementary Fig. 2). Furthermore, the seismic specificity score demonstrates robustness to noise in cell cluster characterizations. In real-world datasets, cell type annotations derived from unsupervised clustering and subsequent manual curation can inevitably contain some mislabeled cells or mixtures of closely related cells. Through a label permutation simulation, we find that the seismic specificity score shows superior resilience to cell type label noise compared to other common specificity metrics (Supplementary Fig. 3), demonstrating that the cell type specificity profiles are relatively less sensitive to slight inaccuracies in upstream cell type definitions.

After calculating the seismic specificity score for a collection of cell types in a scRNA-seq dataset, the seismic framework then applies a regression model to test for significant associations between the specificity scores and MAGMA gene z-scores, under the assumption that the genetically implicated cell types specifically expresses more of these genes with higher trait relevance (“Methods”). For significantly associated cell types, the seismic framework also introduces influential observation analysis to the corresponding regression model, enabling what we term ‘influential gene analysis.’ To our knowledge, influential gene analysis is the first method to systematically rank and identify genes driving purported cell type-trait associations.

To assess how well seismic and three of the most commonly used cell type-trait identification methods (scDRS, FUMA, and S-MAGMA) are calibrated to false positives, we perform a systematic simulation to detect the frequency of type I errors. We first randomly select 10 sets of MAGMA trait z-scores from GWAS (Supplementary Data 1) and subsample 10 expression datasets, each containing 10,000 cells from the Tabula Muris (TM) FACS scRNA-seq data (Supplementary Data 2). For each subsampled expression dataset, we randomly select 100 cells as a cell type of interest (“Methods”). Next, across 10,000 runs, we randomize the gene labels in the expression data and compare the p-values reported by each method for the association between the randomly assigned target cell type and trait. We find that all methods generally control type I error, with FUMA being markedly conservative, potentially limiting its detection power. seismic is, on average, conservative and has stable performance. In contrast, using the analytically transformed p-values from scDRS, we see slightly inflated p-values at extreme quantiles, and S-MAGMA can also, at times, report inflated p-values (Fig. 2A). The seismic and scDRS implementations enable examination of the effect of randomization of MAGMA trait z-scores, and we observe the same trends, where seismic still has generally well-calibrated p-values, and scDRS has slight inflation at tail quantiles (Supplementary Fig. 4).

Complex, polygenic traits frequently involve multiple disease-associated cell types and subtle expression perturbations across a large number of genes. In order to evaluate the extent to which seismic can correctly identify trait-associated cell types reflective of this complexity, we simulate several different scenarios: (1) single or multiple associated cell types; (2) distinct or overlapping genes driving the trait association across cell types; (3) strong or subtle expression perturbations linked with the trait. We use scDesign3 to generate realistic synthetic count-level data that mimics trait-specific expression perturbations at different effect sizes (“Methods”). Applying the four methods to this simulated data, we compare their power to correctly identify the perturbed cell types as trait-associated by calculating the proportion of simulations where the perturbed cell types are significantly associated with the trait (FDR 0.98 across all 27 traits, Supplementary Fig. 9).

We then apply S-MAGMA, FUMA, and scDRS to the same Tabula Muris FACS dataset and GWAS traits, checking for consistency with seismic’s results and any differences in associations (Supplementary Data 4, Supplementary Figs. 10-16). Notably, 89% of associations identified by seismic are also detected by at least one of these frameworks (Fig. 3B, Supplementary Figs. 10-16), where seismic captures most of FUMA’s reported associations (95%), followed by scDRS (88%), then S-MAGMA (81%). This high degree of overlap highlights seismic’s robustness and alignment with established methods. The high correlation is further illustrated in a detailed between-method comparison, which reveals that seismic consistently achieves the highest trait-wise concordance among the methods, as measured by Spearman’s correlation. Specifically, for 26 of the 27 traits examined, seismic and one other framework achieve the highest concordance (Fig. 3C, Supplementary Fig. 17). Notably, seismic shows the highest average between-method Spearman’s correlation across all traits (0.69), compared with 0.60 for scDRS, 0.66 for FUMA, and 0.60 for S-MAGMA. For the 330 common association pairs found by all frameworks, seismic exhibits the most significant false discovery rates (FDR) in 80% of these pairs (263 out of the 330 pairs), demonstrating its power (Supplementary Data 4).

Besides these common findings, seismic also identifies additional association patterns that may better capture the underlying biology. For erythrocyte count, while scDRS ranks several cells from the intestine as most relevant, seismic and S-MAGMA identify several hematopoietic lineage cell types in marrow to be most associated, more accurately reflecting the developmental process of red blood cells. seismic also observes broad associations between neuropsychiatric diseases and various pancreatic islet cell types, which is especially noticeable in depression. These somewhat outlandish associations are also recapitulated by the other methods, and interestingly, previous studies have found potential associations between pancreatic and neuropsychiatric diseases, which has led to increased interest in a potential pancreas-brain axis. In total, seismic only misses 2 cell type-association pairs identified by all other methods (Fig. 3B, Supplementary Data 4), the fewest compared to other methods (56, 29, and 53 pairs for scDRS, FUMA, and S-MAGMA, respectively). The two undetected associations — between microglia and autoimmune disease, as well as pre-activation T cell subtype and ulcerative colitis — are close to the multiple hypothesis threshold (with FDR = 0.066 and 0.096, respectively, Supplementary Data 4).

To assess the degree to which technical factors may affect seismic’s specificity scores, we split the Tabula Muris dataset by individual donor ID and recalculated the scores for each cell type. Specificity scores remained highly consistent across donors for the majority of cell types (Supplementary Fig. 18A), with cell types with larger numbers of cells exhibiting higher correlation. This concordance also extended to the downstream trait-implicated cell type analysis, where a high correlation of statistical significance was observed (mean Pearson’s correlation = 0.88 for cell-type associations () between pairs of donors across all traits, Supplementary Fig. 18B), despite some traits having target cell types not captured in several donors (e.g., liver tissue absent in 4 of 6 donors).

To explore generalizability to other large scRNA-seq datasets, we also apply seismic to the Tabula Muris (TM) droplet dataset (a dataset obtained by droplet-based single-cell sequencing rather than profiling individually sorted cells as in TM FACS, Supplementary Fig. 19) and the Tabula Sapiens (TS) human scRNA-seq dataset (Supplementary Fig. 20). Comparing cell types that overlap between TM FACS and these 2 additional scRNA-seq datasets, one using a different technology, and the other using human cells, we find consistent cell type-trait associations (Supplementary Fig. 21). The mean Spearman’s correlation is 0.78 between TM FACS and TM droplet, and 0.67 between TM FACS and TS across all traits, in terms of statistical significance (in – log(p-value)). We note that neither the TM droplet nor the TS contains brain tissue, and the mean Spearman’s correlation is 0.84 and 0.75, respectively, if neuropsychiatric traits are excluded from the comparison. The high concordance of seismic with other methods is also consistent across datasets (Supplementary Figs. 17, 22, 23). Such consistency underscores seismic’s robustness in identifying trait-associated cell types across datasets of larger size, varying coverage, as well as seismic’s adaptability to different species.

Having examined seismic’s consistency in identifying a wide variety of trait-associated cell types, we turn our attention to evaluate the accuracy of seismic’s ability to distinguish known vulnerable neuron types for a well-characterized neurological disease, Parkinson’s disease (PD). PD pathophysiology is well-established, with dopaminergic neurons residing in the substantia nigra pars compacta (SNc) and ventral tegmental area (VTA) characterized as being particularly vulnerable to degeneration. Using a large mouse brain dataset encompassing up to 231 distinct cell types from 9 regions of the adult mouse brain, in conjunction with a recent PD GWAS study with over 480,000 participants, we test whether seismic, scDRS, FUMA, and S-MAGMA can recover known PD associations (Fig. 4, Supplementary Data 5).

The rich brain region and cell type annotations in provide a unique opportunity to test how changes in cell type granularity affect the reported cell type-trait annotations. We examine 5 different granularities of cell types, ranging from 14 broad subclass labels to 231 highly-specific cell annotations (brain region + fine cluster). seismic is the only method to significantly prioritize PD-relevant dopaminergic neurons across all cell type granularities (Fig. 4). FUMA and scDRS also rank relevant cell types in some granularities highly, but mostly fail to reach statistical significance after multiple hypothesis test correction. Notably, S-MAGMA completely misses these vulnerable cell types. We note also that most previous cell type-trait association analyses that use datasets such as typically perform their analyses at a broader cell type level (usually what we have termed the ‘brain region + class’ granularity). Though using finer resolution annotations increases the number of multiple hypotheses compared, we demonstrate that it may be a worthwhile trade-off when using a more powerful association detection method, as it can lead to more precise biological insights.

The choice of an endpoint for cell-type trait association can allow for the dissection of cell contribution to various endophenotypes of disease. This is particularly true for diseases with multicellular pathogenesis like Alzheimer’s disease (AD), where genetic studies have mapped clinical and alternative traits, making it also an ideal test case for demonstrating the power of seismic. Furthermore, while selective neuronal vulnerability and pathological lesion formation have been thoroughly described in AD, the precise molecular mechanisms driving neurodegeneration leading to cognitive decline remain poorly understood. Formally, AD is characterized by two pathological hallmarks, extracellular amyloid plaques composed of Aβ peptide and intracellular neurofibrillary tangles (NFTs) formed by aggregated tau protein. NFTs appear according to a stereotypical spatial pattern, first emerging in layer II of the entorhinal cortex (EC), later appearing in deeper layers of EC and CA1 in the hippocampus, before subsequently spreading to other neocortical and subcortical regions. Progression of NFTs is accompanied by neurodegeneration in the affected area. In spite of the strong correlation between clinical symptoms of the disease and neuronal processes (NFT formation, neurodegeneration, synapse loss), many GWAS studies have primarily identified associations with immune cells such as microglia. This leads to questions of whether microglia are the primary drivers of the disease or merely responsible for the clinical symptoms of the disease. If the latter, one would expect to find non-microglia associations for GWAS with non-clinical, pathology-based endpoints, which could open new research avenues for understanding pathogenic mechanisms.

We use seismic to test whether GWAS for different AD-related endpoints might yield divergent cell-type associations. Given that AD pathology typically exhibits selective regional vulnerability, the analysis was conducted at the most fine-grained resolution (‘brain region + fine cluster’ in Fig. 4). We first used an AD GWAS that includes a large cohort of >63,000 patients diagnosed via clinical observations (clinical GWAS). This large study is representative of the AD GWAS typically used in cell-type trait association studies. We also explore seismic results for a GWAS for an alternative AD endophenotype comprised of around 3100 patient samples of cerebrospinal fluid (CSF) tau levels, which serves as a biomarker of AD progression (tau GWAS). Though the tau GWAS has a much smaller patient cohort, we hypothesized that it may deliver clues for pathological mechanisms that have remained elusive with the clinical GWAS.

Applying seismic on the clinical AD study along with the expression data from Saunders et al., we identify microglial cells from various brain regions as the most associated with clinical GWAS (Fig. 5A), demonstrating the pervasive neuroinflammation patterns underlying clinical symptoms in AD patients. This is consistent with previous studies, and indeed, we see that scDRS also identifies significant associations between clinical AD with microglial cells; neither FUMA nor S-MAGMA find statistically significant associations, though FUMA does rank microglial associations as highest among cell types (Supplementary Fig. 24). It is noteworthy that microglial cells from both vulnerable (hippocampus) and resistant (striatum) regions of the brain are similarly associated with the trait, suggesting that they are not the primary driver of regional differences in pathology. Using the tau GWAS with the same Saunders et al. expression dataset, we see that scDRS seems to suggest generic astrocyte associations. Though FUMA and S-MAGMA still struggle with identifying statistically significant associations, they do prioritize diverse neuronal associations, including one of the neuronal populations known to be vulnerable to tau accumulation in AD (“HC neurons entorhinal cortex”). Meanwhile, using seismic, we find significant associations with several of the neuronal populations most vulnerable to tau accumulation in AD (Fig. 5B), including deep (“HC neurons entorhinal cortex”) and superficial (“HC neurons medial entorhinal cortex 1”) layers of the entorhinal cortex, as well as CA1 pyramidal cells. While these populations have been established as vulnerable based on pathology studies, such associations have not previously been found using trait-association studies based on AD-related GWAS. These results suggest that tau pathology in AD may be more intrinsically linked to neuronal susceptibility than inflammation alone.

With seismic’s influential gene analysis, we can more closely inspect the genes corresponding to increased risk driving the top cell type-trait association. For hippocampal microglia, we identify 88 genes as positively influential for clinical AD diagnosis, and find these to be enriched for immune-related GO processes (Supplementary Fig. 25, Supplementary Data 6). Some of these genes are expected, like SPI1, MS4A6A, and TREM2, which are microglia-specific and known to have significantly reported associated SNPs with clinical AD. There are also several other interesting genes identified by seismic, such as LAPTM5, an amyloid plaque responsive gene, and phagocytosis regulator VAV1. Much less expected are the influential genes found for entorhinal cortex neurons and the tau GWAS (218 genes for neurons from deep layers of the entorhinal cortex, 199 genes for neurons from entorhinal cortex layer II) since, as mentioned, previous GWAS have not yielded clear neuronal genes associated with AD-related traits (Fig. 5C, D, Supplementary Data 6).

There is some overlap of influential genes shared by all AD vulnerable neurons (different layers of the entorhinal cortex and hippocampus CA1) associated with the tau GWAS (Supplementary Fig. 26). However, few pathways are consistently enriched across all three neuron types (Supplementary Data 6). Instead, both entorhinal cortex populations are enriched in genes involved in axon guidance, while CA1 and entorhinal cortex layer II (ECII) show enrichment in genes related to long-term potentiation (Fig. 5C, D, Supplementary Data 6), suggesting that distinct cellular processes contribute to the association of these cell types with CSF tau. Notably, the enrichment of both long-term potentiation and axon guidance genes in ECII aligns well with our previous study, which used orthogonal datasets and analysis strategies to demonstrate that regulators of structural and electrophysiological features of the axon underlie ECII vulnerability. This convergence of evidence strongly points to ECII axons as a defining Achilles’ heel for these neurons. Beyond these axonal and synaptic pathways, we find that genes driving associations with CSF tau levels in deep layers of the EC show strong enrichment in several metabolic pathways (e.g., cellular respiration, electron transport chain), while those driving associations with tau levels in layer II neurons are more enriched in proteostasis pathways (e.g., protein destabilization) (Supplementary Fig. 27, Supplementary Data 6). Metabolic and proteostatic contributions to vulnerability of EC are among the very pathways previously suggested to underlie EC vulnerability, and several identified influential genes, such as VPS26A, have connections with EC vulnerability in both cell types. The fact that seismic does not find any association between microglia and CSF tau further suggests that microglia might not be the main drivers of tau pathology, but rather the drivers of the clinical manifestations of AD. Additionally, we have found that using the tau GWAS enables seismic to uncover neuronal associations as well as genes and pathways with important mechanistic and therapeutic potential. These results demonstrate the value of more targeted endophenotype GWAS — albeit smaller — for complex diseases.

Disentangling associations between complex traits and cell types with seismic – Nature Communications

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.