FastCCC for Reference-based CCC Analysis
Download our constructed human CCC reference panel
Coming soon
We are currently preparing to upload the fully constructed CCC reference panel, which will be made publicly available soon.
Tips
We are currently preparing to upload the fully constructed CCC reference panel, which will be made publicly available soon. In the meantime, if you would like to experiment with our reference panel or you want to construct your own reference panels, you can follow our tutorials on downloading reference datasets from CellxGene and constructing a reference panel from scratch. By following these steps, you can recreate the same reference panel and achieve identical results.
How to perform reference-based CCC analysis on a user-collected query dataset
import fastccc.infer_query
## Modify the file path according to the location where you run the code.
database_file_path = 'FastCCC/db/CPDBv5.0.0/'
reference_path = 'your/save/path/reference/lung/'
tissue_query_file = 'your/save/path/user_collected_query.h5ad'
save_path = 'your/save/path/user_collected_query/'
fastccc.infer_query.infer_query_workflow(
database_file_path = database_file_path,
reference_path = reference_path,
query_counts_file_path = tissue_query_file,
celltype_file_path = None,
save_path = save_path,
meta_key = 'cell_type'
)
output
2025-01-26 20:54:32 | INFO | Start inferring by using CCC reference: lung 2025-01-26 20:54:32 | INFO | Reference min_percentile = 0.1 2025-01-26 20:54:32 | INFO | Reference LRI DB = CPDBv5.0.0 2025-01-26 20:54:35 | INFO | Reading query adata, (your data) cells x (your data) genes. 2025-01-26 20:54:51 | SUCCESS | Rank preprocess done. 2025-01-26 20:54:53 | INFO | Loading LRIs database. hgnc_symbol as gene name is requested. 2025-01-26 20:54:55 | SUCCESS | Requested data for fastccc is prepared. 2025-01-26 20:54:55 | INFO | Loading reference data. 2025-01-26 20:54:57 | INFO | Reference cell types label will be used directly. 2025-01-26 20:54:58 | SUCCESS | Reference data is loaded. 2025-01-26 20:54:58 | INFO | Calculating CS score for query data. 2025-01-26 20:55:00 | INFO | Filtering reference data. 2025-01-26 20:55:01 | INFO | Filtering by using reference. 2025-01-26 20:55:01 | INFO | Inferring sig. boundaries. 2025-01-26 20:55:03 | INFO | Saving inference results. 2025-01-26 20:55:04 | SUCCESS | Inference workflow done.
Download large-scale normal reference dataset from CellxGene
import cellxgene_census
census = cellxgene_census.open_soma()
filter_condition = "tissue_general == 'lung' "
filter_condition += "and disease == 'normal' "
filter_condition += "and is_primary_data == False "
filter_condition += "and cell_type!='unknown' "
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=filter_condition
)
adata.write_h5ad('your/save/path/census_download_lung.h5ad')
How to build a human CCC reference panel from scratch using FastCCC
Here, we use a single-cell dataset of normal lung tissue obtained from CellxGene as an example to construct a human reference panel for lung tissue. Users can, of course, use their own curated data to build a personalized CCC reference panel and later compare it with their own query dataset.
First, ensure that the reference dataset is in raw count matrix format. Unlike the conventional FastCCC workflow, to minimize batch effects across different datasets, we use raw count data. The query data should also maintain the raw count matrix format. FastCCC will perform basic filtering, removing cells with insufficient gene counts (min_genes=50
for scanpy.pp.filter_cells
). If additional QC steps are required, users should delete the corresponding cells or genes from the raw data after performing their own QC.
Additionally, we do not use highly variable genes (HVGs), so it is recommended to retain all genes or transcripts. A reduced gene count may undermine the objectivity and effectiveness of the quantification. Lastly, ensure that the var_names in your AnnData object correspond to standard HGNC gene symbols.
We processed the lung dataset downloaded from CellxGene using the following code and saved it as the lung reference dataset.
## Reference Dataset Preparation.
import scanpy as sc
tissue_reference_file = 'your/save/path/census_download_lung.h5ad'
reference_adata = sc.read_h5ad(tissue_reference_file)
## Notice:
## Ensure that the var_names are using HGNC gene symbols.
## We need to manually convert the IDs, as CellxGene uses Ensembl ID.
## If the format is already correct, you can ignore the following line of code.
reference_adata.var_names = reference_adata.var.feature_name
## The following line of code is used to ensure no data leakage in our validation.
## Users can ignore this.
observation_joinid = sorted(set(reference_adata.obs.observation_joinid))
import pickle
with open(f'your/save/path/lung_reference_joinid.pkl', 'wb') as f:
pickle.dump(observation_joinid, f)
## Notice ends.
reference_adata.write_h5ad('your/save/path/lung_reference.h5ad')
Then, one can build a reference panel like the following code
## Build human lung reference panel.
import fastccc.build_reference
## Modify the file path according to the location where you run the code.
database_file_path = 'FastCCC/db/CPDBv5.0.0/'
tissue_reference_file = 'your/save/path/lung_reference.h5ad'
save_path = 'your/save/path/reference/'
reference_name = 'lung',
fastccc.build_reference.build_reference_workflow(
database_file_path = database_file_path,
reference_counts_file_path = tissue_reference_file,
celltype_file_path = None,
reference_name = reference_name,
save_path = save_path,
meta_key = 'cell_type'
)
output
2025-01-27 10:33:14 | INFO | Start building CCC reference. 2025-01-27 10:33:14 | INFO | Reference_name = lung 2025-01-27 10:33:14 | INFO | min_percentile = 0.1 2025-01-27 10:33:14 | INFO | LRI database = CPDBv5.0.0 2025-01-27 10:33:14 | SUCCESS | Reference save dir your/save/path/reference/lung is created. 2025-01-27 10:34:03 | INFO | Reading reference adata, 1673947 cells x 60530 genes. 2025-01-27 10:36:55 | SUCCESS | Rank preprocess done. 2025-01-27 10:37:23 | INFO | Loading LRIs database. hgnc_symbol as gene name is requested. 2025-01-27 10:38:08 | SUCCESS | Requested data for fastccc is prepared. 2025-01-27 10:38:08 | INFO | Running FastCCC. 2025-01-27 10:38:36 | INFO | Calculating null distributions. 2025-01-27 10:39:25 | INFO | Saving reference. 2025-01-27 10:39:30 | INFO | Saving reference config. 2025-01-27 10:39:30 | SUCCESS | Reference 'lung' is built.
Functions
To be continued.