Curate AnnData based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene against the CELLxGENE schema v5.1.0.

Load your instance where you want to register the curated AnnData object:

# !pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.1
!lamin init --storage ./test-cellxgene-curate --name test-cellxgene-curate --schema bionty
Hide code cell output
→ connected lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import cellxgene_lamin as cxg
→ connected lamindb: testuser1/test-cellxgene-curate

Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

adata = cxg.datasets.anndata_human_immune_cells()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1626 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
    var: 'feature_is_filtered'
    uns: 'default_embedding'
    obsm: 'X_umap'

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1
Hide code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'assay' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'organism' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'sex' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'tissue' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: 'title' in 'uns' is not present.
ERROR: 'ENSG00000269933' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261737' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000259834' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256374' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000263464' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000203812' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272196' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272880' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000270188' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000287116' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000237133' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000224739' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000227902' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000239467' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272551' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000280374' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000236886' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000229352' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000286601' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000227021' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000259855' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273301' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000271870' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000237838' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000286996' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000269028' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000286699' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273370' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261490' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272567' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000270394' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272370' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272354' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000251044' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272040' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000182230' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000204092' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261068' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000236740' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000236996' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000232295' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000271734' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000236673' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000227220' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000236166' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000112096' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000285162' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000286228' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000237513' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000285106' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000226380' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000270672' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000225932' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000244693' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000268955' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000272267' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000253878' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000259820' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000226403' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000233776' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000269900' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261534' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000237548' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000239665' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256892' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000249860' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000271409' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000224745' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261438' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000231575' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000260461' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000255823' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000254740' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000254561' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000282080' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256427' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000287388' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000276814' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000280710' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000215271' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000258414' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000258808' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000277050' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273888' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000258861' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000259444' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000244952' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273923' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000262668' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000232196' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256618' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000221995' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000226377' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273576' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000267637' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000282965' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273837' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000286949' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256222' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000280095' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000278927' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000278955' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000277352' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000239446' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000256045' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000228906' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000228139' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000261773' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000278198' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000273496' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000277666' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000278782' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000277761' is not a valid feature ID in 'var'.
ERROR: 'ENSG00000269933' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261737' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000259834' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256374' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000263464' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000203812' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272196' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272880' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000270188' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000287116' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000237133' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000224739' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000227902' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000239467' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272551' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000280374' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000236886' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000229352' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000286601' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000227021' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000259855' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273301' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000271870' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000237838' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000286996' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000269028' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000286699' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273370' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261490' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272567' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000270394' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272370' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272354' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000251044' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272040' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000182230' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000204092' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261068' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000236740' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000236996' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000232295' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000271734' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000236673' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000227220' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000236166' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000112096' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000285162' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000286228' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000237513' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000285106' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000226380' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000270672' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000225932' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000244693' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000268955' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000272267' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000253878' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000259820' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000226403' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000233776' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000269900' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261534' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000237548' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000239665' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256892' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000249860' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000271409' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000224745' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261438' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000231575' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000260461' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000255823' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000254740' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000254561' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000282080' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256427' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000287388' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000276814' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000280710' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000215271' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000258414' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000258808' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000277050' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273888' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000258861' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000259444' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000244952' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273923' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000262668' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000232196' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256618' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000221995' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000226377' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273576' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000267637' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000282965' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273837' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000286949' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256222' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000280095' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000278927' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000278955' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000277352' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000239446' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000256045' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000228906' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000228139' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000261773' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000278198' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000273496' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000277666' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000278782' is not a valid feature ID in 'raw.var'.
ERROR: 'ENSG00000277761' is not a valid feature ID in 'raw.var'.
ERROR: Dataframe 'obs' is missing column 'cell_type_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'disease_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'organism_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'tissue_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'development_stage_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'is_primary_data'.
ERROR: Dataframe 'obs' is missing column 'donor_id'.
ERROR: Dataframe 'obs' is missing column 'suspension_type'.
ERROR: Dataframe 'obs' is missing column 'tissue_type'.
Validation complete in 0:00:00.282896 with status is_valid=False

Validate and curate metadata

We create a Curate object that references the AnnData object. During instantiation, any :class:~lamindb.Feature records are saved.

curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")
Hide code cell output
✓ added 1 record with Feature.name for "columns": 'sex_ontology_term_id'
✓ added 4 records from laminlabs/cellxgene with Feature.name for "columns": 'assay', 'cell_type', 'tissue', 'organism'
✓ added 1 record from laminlabs/cellxgene with Feature.name for "columns": 'sex'

Let’s fix the “donor_id” column name:

adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)
validated = curator.validate()
✗ missing required obs columns development_stage, disease, self_reported_ethnicity, suspension_type, tissue_type
• consider initializing a Curate object like 'Curate(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS)'to automatically add these columns with default values.

For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:

cxg.CellxGeneFields.OBS_FIELD_DEFAULTS
Hide code cell output
{'organism': 'unknown',
 'assay': 'unknown',
 'cell_type': 'unknown',
 'development_stage': 'unknown',
 'disease': 'normal',
 'donor_id': 'unknown',
 'self_reported_ethnicity': 'unknown',
 'sex': 'unknown',
 'suspension_type': 'cell',
 'tissue_type': 'tissue'}
curator = cxg.Curator(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS, organism="human", schema_version="5.1.0")
Hide code cell output
→ added defaults to the AnnData object: {'development_stage': 'unknown', 'disease': 'normal', 'self_reported_ethnicity': 'unknown', 'suspension_type': 'cell', 'tissue_type': 'tissue'}
✓ added 6 records from laminlabs/cellxgene with Feature.name for "columns": 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'tissue_type', 'suspension_type'
validated = curator.validate()
validated
Hide code cell output
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'var_index'
✓ added 36390 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
• saving validated records of 'assay'
✓ added 3 records from public with ExperimentalFactor.name for "assay": '10x 5' v2', '10x 5' v1', '10x 3' v3'
• saving validated records of 'cell_type'
✓ added 31 records from public with CellType.name for "cell_type": 'CD4-positive helper T cell', 'memory B cell', 'regulatory T cell', 'alveolar macrophage', 'group 3 innate lymphoid cell', 'plasmacytoid dendritic cell', 'plasma cell', 'naive B cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'CD16-negative, CD56-bright natural killer cell, human', 'plasmablast', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'T follicular helper cell', 'germinal center B cell', 'macrophage', 'CD16-positive, CD56-dim natural killer cell, human', 'conventional dendritic cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'mucosal invariant T cell', 'lymphocyte', ...
✓ added 1 record from laminlabs/cellxgene with DevelopmentalStage.name for "development_stage": 'unknown'
✓ added 1 record from laminlabs/cellxgene with Disease.name for "disease": 'normal'
✓ added 1 record from laminlabs/cellxgene with Ethnicity.name for "self_reported_ethnicity": 'unknown'
✓ added 1 record from laminlabs/cellxgene with Phenotype.ontology_id for "sex_ontology_term_id": 'PATO:0000384'
✓ added 1 record from laminlabs/cellxgene with ULabel.name for "suspension_type": 'cell'
• saving validated records of 'tissue'
✓ added 16 records from public with Tissue.name for "tissue": 'mesenteric lymph node', 'transverse colon', 'caecum', 'bone marrow', 'skeletal muscle tissue', 'blood', 'liver', 'sigmoid colon', 'thymus', 'thoracic lymph node', 'omentum', 'duodenum', 'ileum', 'lamina propria', 'jejunal epithelium', 'spleen'
✓ added 1 record from laminlabs/cellxgene with ULabel.name for "tissue_type": 'tissue'
• mapping "var_index" on Gene.ensembl_gene_id
!   113 terms are not validated: 'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000227902', 'ENSG00000239467', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
✓ "assay" is validated against ExperimentalFactor.name
✓ "cell_type" is validated against CellType.name
✓ "development_stage" is validated against DevelopmentalStage.name
✓ "disease" is validated against Disease.name
• mapping "donor_id" on ULabel.name
!   12 terms are not validated: 'D496-1', '621B-1', 'A29-1', 'A36-1', 'A35-1', '637C-1', 'A52-1', 'A37-1', 'D503-1', '640C-1', 'A31-1', '582C-1'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
✓ "self_reported_ethnicity" is validated against Ethnicity.name
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against ULabel.name
• mapping "tissue" on Tissue.name
!   1 term is not validated: 'lungg'
    → fix typos, remove non-existent values, or save terms via .add_new_from("tissue")
✓ "tissue_type" is validated against ULabel.name
✓ "organism" is validated against Organism.name
False

Remove unvalidated values

We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).

adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
         :, ~raw_data.var_names.isin(curator.non_validated["var_index"])
    ].copy()
    adata.raw = raw_data
# We must create the Curate object again to ensure that it references the correct AnnData object
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")

Register new metadata labels

Following the suggestions above to register genes and labels that aren’t present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)

For donors, we register the new labels:

curator.add_new_from("donor_id")
Hide code cell output
✓ added 12 records with ULabel.name for "donor_id": 'D496-1', 'A52-1', 'A35-1', '621B-1', '637C-1', 'D503-1', '640C-1', 'A29-1', 'A36-1', '582C-1', 'A37-1', 'A31-1'

An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:

tissues = curator.lookup().tissue
tissues.lung
Hide code cell output
Tissue(uid='7Tt4iEKc', name='lung', ontology_id='UBERON:0002048', synonyms='pulmo', description='Respiration Organ That Develops As An Outpocketing Of The Esophagus.', created_by_id=1, source_id=47, created_at=2023-11-28 22:50:53 UTC)
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)

Let’s validate the object again:

validated = curator.validate()
validated
Hide code cell output
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'tissue'
✓ added 1 record from public with Tissue.name for "tissue": 'lung'
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "assay" is validated against ExperimentalFactor.name
✓ "cell_type" is validated against CellType.name
✓ "development_stage" is validated against DevelopmentalStage.name
✓ "disease" is validated against Disease.name
✓ "donor_id" is validated against ULabel.name
✓ "self_reported_ethnicity" is validated against Ethnicity.name
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against ULabel.name
✓ "tissue" is validated against Tissue.name
✓ "tissue_type" is validated against ULabel.name
✓ "organism" is validated against Organism.name
True
adata.obs.head()
Hide code cell output
donor_id tissue cell_type assay sex_ontology_term_id organism sex development_stage disease self_reported_ethnicity suspension_type tissue_type
CZINY-0109_CTGGTCTAGTCTGTAC D496-1 blood classical monocyte 10x 3' v3 PATO:0000384 human unknown unknown normal unknown cell tissue
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT 621B-1 thoracic lymph node T follicular helper cell 10x 5' v2 PATO:0000384 human unknown unknown normal unknown cell tissue
Pan_T7935491_CTGGTCTGTACATGTC A29-1 spleen memory B cell 10x 5' v1 PATO:0000384 human unknown unknown normal unknown cell tissue
Pan_T7980367_GGGCATCCAGGTGGAT A36-1 lung alveolar macrophage 10x 5' v1 PATO:0000384 human unknown unknown normal unknown cell tissue
Pan_T7935494_ATCATGGTCTACCTGC A29-1 mesenteric lymph node naive thymus-derived CD4-positive, alpha-beta ... 10x 5' v1 PATO:0000384 human unknown unknown normal unknown cell tissue

Save artifact

artifact = curator.save_artifact(description=f"dataset curated against cellxgene schema {curator.schema_version}")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
artifact.describe()
Hide code cell output
Artifact(uid='WuAUqVtSFfN1ihxH0000', is_latest=True, description='dataset curated against cellxgene schema 5.1.0', suffix='.h5ad', type='dataset', size=54670616, hash='VYhEnkViOhtD-7kN2odUGw', n_observations=1626, _hash_type='sha1-fl', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-27 21:54:37 UTC)
  Provenance
    .storage = '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate'
    .created_by = 'testuser1'
  Labels
    .organisms = 'human'
    .tissues = 'mesenteric lymph node', 'transverse colon', 'caecum', 'bone marrow', 'skeletal muscle tissue', 'blood', 'liver', 'sigmoid colon', 'thymus', 'thoracic lymph node', ...
    .cell_types = 'CD4-positive helper T cell', 'memory B cell', 'regulatory T cell', 'alveolar macrophage', 'group 3 innate lymphoid cell', 'plasmacytoid dendritic cell', 'plasma cell', 'naive B cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'CD16-negative, CD56-bright natural killer cell, human', ...
    .diseases = 'normal'
    .phenotypes = 'male'
    .experimental_factors = '10x 5' v2', '10x 5' v1', '10x 3' v3'
    .developmental_stages = 'unknown'
    .ethnicities = 'unknown'
    .ulabels = 'cell', 'tissue', 'D496-1', 'A52-1', 'A35-1', '621B-1', '637C-1', 'D503-1', '640C-1', 'A29-1', ...
  Feature sets
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
    'obs' = 'assay', 'cell_type', 'tissue', 'organism', 'sex_ontology_term_id', 'sex', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'tissue_type', 'suspension_type'
  Feature values -- internal
    'assay' = 10x 3' v3, 10x 5' v1, 10x 5' v2
    'cell_type' = CD16-negative, CD56-bright natural killer cell, human, CD16-positive, CD56-dim natural killer cell, human, CD4-positive helper T cell, CD8-positive, alpha-beta memory T cell, CD8-positive, alpha-beta memory T cell, CD45RO-positive, T follicular helper cell, alpha-beta T cell, alveolar macrophage, classical monocyte, conventional dendritic cell, ...
    'development_stage' = unknown
    'disease' = normal
    'donor_id' = 582C-1, 621B-1, 637C-1, 640C-1, A29-1, A31-1, A35-1, A36-1, A37-1, A52-1, ...
    'organism' = human
    'self_reported_ethnicity' = unknown
    'sex_ontology_term_id' = male
    'suspension_type' = cell
    'tissue' = blood, bone marrow, caecum, duodenum, ileum, jejunal epithelium, lamina propria, liver, lung, mesenteric lymph node, ...
    'tissue_type' = tissue

The below is optional – it mimics the way cellxgene creates collections of AnnData objects to link them to studies.

# register a new collection
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
collection = ln.Collection(
    [artifact],  # registered artifact above, can also pass a list of artifacts
    name=title,  # title of the publication
    description="10.1126/science.abl5197",  # DOI of the publication
    reference="E-MTAB-11536",  # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress",  # source type (e.g. GEO, ArrayExpress, SRA, etc.)
).save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run

Return an input h5ad file for cellxgene-schema

adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg
Hide code cell output
AnnData object with n_obs × n_vars = 1626 × 36390
    obs: 'donor_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'organism_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data'
    var: 'feature_is_filtered'
    uns: 'default_embedding', 'title', 'cxg_lamin_schema_reference', 'cxg_lamin_schema_version'
    obsm: 'X_umap'
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1
Hide code cell output
Loading dependencies
Loading validator modules
Starting validation...
Validation complete in 0:00:03.544500 with status is_valid=True

Note

The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.