TCGA Datasets

The TCGA loader downloads, caches, and exposes 24 packaged TCGA miRNA cohorts — each containing the raw expression matrix, three normalizations, and CVAE-synthesized counterparts — through a small Python API. For a runnable end-to-end walkthrough, see TCGA Quick Start.

Overview 

Unlike the small parquet files in Example Datasets, TCGA cohorts are downloaded on first use (~10–35 MB each) and cached locally. Each dataset packages four kinds of data: the raw counts, three normalizations (raw_norm, TC, DESeq), and nine synthetic groups (three CVAE models × three normalizations). Once cached, all subsequent access is local.

Quick Start 

from syng_bts import list_tcga_datasets, load_tcga_dataset

# Browse available cohorts
list_tcga_datasets(short=True)

# Load a cohort (downloads on first call)
ds = load_tcga_dataset("BRCA")

# Real expression data (DESeq-normalized by default)
real_df, real_groups = ds.real("TC")
real_df.shape
real_groups.value_counts()

# CVAE-synthesized counterpart — synth(normalization, model)
synth_df, synth_groups = ds.synth("TC", "CVAE1_5")
synth_df.shape

The real() and synth() accessors return a (expression, groups) tuple — the expression matrix as a pandas.DataFrame and the group labels as a pandas.Series. Per-slice metadata (KL weight, epochs trained, etc.) is held by the underlying Subset objects, reachable via ds.processed[norm] and ds.synthetic[norm][model].

Data layout 

TCGADataset
│   attributes: name, cancer_type, group_labels, n_filtered_samples, ...
│
├── ds.raw                              → Subset
│       unfiltered raw counts, 1881 features
│
├── ds.processed[<norm>]                → Subset
│       filtered, log2-transformed
│       <norm> ∈ {"raw_norm", "TC", "DESeq" (default)}
│
└── ds.synthetic[<norm>][<model>]       → Subset
        CVAE-generated, 1000 samples
        <norm>  ∈ {"raw_norm", "TC", "DESeq" (default)}
        <model> ∈ {"CVAE1_5" (default), "CVAE1_10", "CVAE1_20"}

Each Subset has .expression (DataFrame), .groups (Series), and .metadata (dict).

Available Datasets 

TCGA cohorts in data-v1.0
Code	Cancer name	Samples
BLCA	Bladder Urothelial Carcinoma	432
BRCA	Breast Invasive Carcinoma	1144
COAD	Colon Adenocarcinoma	414
ESCA	Esophageal Carcinoma	200
HNSC	Head and Neck Squamous Cell Carcinoma	569
KICH	Kidney Chromophobe	91
KIRC	Kidney Renal Clear Cell Carcinoma	613
KIRP	Kidney Renal Papillary Cell Carcinoma	295
LAML	Acute Myeloid Leukemia	187
LIHC	Liver Hepatocellular Carcinoma	352
LUAD	Lung Adenocarcinoma	557
LUSC	Lung Squamous Cell Carcinoma	516
OV	Ovarian Serous Cystadenocarcinoma	502
PAAD	Pancreatic Adenocarcinoma	183
PCPG	Pheochromocytoma and Paraganglioma	187
PRAD	Prostate Adenocarcinoma	536
READ	Rectum Adenocarcinoma	144
SARC	Sarcoma	263
SKCM	Skin Cutaneous Melanoma	449
STAD	Stomach Adenocarcinoma	490
TGCT	Testicular Germ Cell Tumors	139
THCA	Thyroid Carcinoma	573
THYM	Thymoma	123
UCS	Uterine Carcinosarcoma	57

Sample counts reflect TC-normalized real expression. Richer per-cohort metadata is available at runtime via the TCGADataset attributes (ds.cancer_type, ds.group_labels, ds.schema_version, etc.) and per-slice on the underlying Subset objects (ds.processed[norm].metadata, ds.synthetic[norm][model].metadata).

Working with TCGADataset 

real() and synth() return tuples 

Both real() and synth() return a tuple[pandas.DataFrame, pandas.Series] — the expression matrix and the aligned group labels:

expression — pandas.DataFrame of shape (n_samples, n_features). Columns are miRNA features.
groups — pandas.Series aligned to expression.index with group labels (e.g. tumor vs. normal).

This is the most common usage pattern: pass expression to a model and groups to a classifier.

Reaching the underlying Subset for metadata 

For per-slice metadata (HDF5 attributes captured at dataset assembly time — normalization method, transform, KL weight, epochs trained, etc.), reach the underlying Subset directly:

real_subset = ds.processed["TC"]                    # Subset for real / TC
synth_subset = ds.synthetic["TC"]["CVAE1_5"]        # Subset for synthetic / TC / CVAE1_5

real_subset.expression       # same DataFrame returned by ds.real("TC")[0]
real_subset.groups           # same Series returned by ds.real("TC")[1]
real_subset.metadata         # dict of HDF5 attributes

Choosing a normalization 

Three normalizations are available (constants in syng_bts.tcga.VALID_NORMALIZATIONS):

"raw_norm" — raw counts after preprocessing
"TC" — total-count normalized; common starting point for miRNA differential analyses
"DESeq" (default) — DESeq2 size-factor normalized; preferred when downstream analysis assumes DESeq2 conventions

Choosing a synthetic model 

Three CVAE variants (constants in syng_bts.tcga.VALID_MODELS) trade off reconstruction fidelity vs. latent-space diversity via the KL weight:

"CVAE1_5" (default) — KL weight 5; balanced
"CVAE1_10" — KL weight 10; higher diversity, lower reconstruction
"CVAE1_20" — KL weight 20; highest diversity

Caching and Offline Use 

Cache location 

By default, datasets are cached under ~/.cache/syng-bts/tcga/<version>/, where <version> is the manifest version (e.g. data-v1.0). Inspect the active cache root at runtime:

from syng_bts import tcga_cache_dir
tcga_cache_dir()
# PosixPath('/home/alice/.cache/syng-bts/tcga')

Custom cache directory 

Set the SYNG_BTS_CACHE_DIR environment variable to override the cache root. Useful for shared filesystems, CI runners with restricted homes, or when you want all SyNG-BTS caches in one place:

export SYNG_BTS_CACHE_DIR=/data/shared/syng-bts-cache

First-call download semantics 

The first call to load_tcga_dataset() for a given dataset fetches the manifest, downloads the corresponding HDF5 (~10–35 MB), verifies its sha256, and atomically renames the temporary file into the versioned cache directory. The download will retry once on sha256 mismatch or transient network error before raising.

Force redownload and cleanup 

To redownload a single cohort (e.g. after a corrupt download or to pick up an updated file from a custom mirror), pass force=True:

ds = load_tcga_dataset("BRCA", force=True)

To remove the entire TCGA cache (all versions, all cohorts):

from syng_bts import clear_tcga_cache
clear_tcga_cache()

Advanced 

Custom manifest URL 

For staging or private mirrors, pass manifest_url= to load_tcga_dataset() or list_tcga_datasets():

manifest = "https://my-mirror.example/tcga/manifest.json"
ds = load_tcga_dataset("BRCA", manifest_url=manifest)

The mirror must serve a manifest with the same schema as the published data-v1.0 manifest.

Dataset versioning 

The cache is keyed by manifest version. Releasing a new manifest (e.g. data-v1.1) creates a new versioned subdirectory under tcga_cache_dir() and does not invalidate or replace existing data-v1.0 files. Pinning to an older manifest with manifest_url= will still read from its versioned cache subdirectory.

Errors users may see 

ValueError("Corrupt HDF5 at {path}; pass force=True to redownload.") — the cached file is unreadable. Pass force=True to redownload.
OSError — manifest unreachable or sha256 mismatch persists after one retry. Check connectivity and the manifest_url argument.