TCGA Datasets
The TCGA loader downloads, caches, and exposes 24 packaged TCGA miRNA cohorts — each containing the raw expression matrix, three normalizations, and CVAE-synthesized counterparts — through a small Python API. For a runnable end-to-end walkthrough, see TCGA Quick Start.
Overview
Unlike the small parquet files in Example Datasets, TCGA cohorts are
downloaded on first use (~10–35 MB each) and cached
locally. Each dataset packages four kinds of data: the raw counts, three
normalizations (raw_norm, TC, DESeq), and nine synthetic
groups (three CVAE models × three normalizations). Once cached, all
subsequent access is local.
Quick Start
from syng_bts import list_tcga_datasets, load_tcga_dataset
# Browse available cohorts
list_tcga_datasets(short=True)
# Load a cohort (downloads on first call)
ds = load_tcga_dataset("BRCA")
# Real expression data (DESeq-normalized by default)
real_df, real_groups = ds.real("TC")
real_df.shape
real_groups.value_counts()
# CVAE-synthesized counterpart — synth(normalization, model)
synth_df, synth_groups = ds.synth("TC", "CVAE1_5")
synth_df.shape
The real() and synth() accessors return a
(expression, groups) tuple — the expression matrix as a
pandas.DataFrame and the group labels as a
pandas.Series. Per-slice metadata (KL weight, epochs trained,
etc.) is held by the underlying Subset objects,
reachable via ds.processed[norm] and ds.synthetic[norm][model].
Data layout
TCGADataset
│ attributes: name, cancer_type, group_labels, n_filtered_samples, ...
│
├── ds.raw → Subset
│ unfiltered raw counts, 1881 features
│
├── ds.processed[<norm>] → Subset
│ filtered, log2-transformed
│ <norm> ∈ {"raw_norm", "TC", "DESeq" (default)}
│
└── ds.synthetic[<norm>][<model>] → Subset
CVAE-generated, 1000 samples
<norm> ∈ {"raw_norm", "TC", "DESeq" (default)}
<model> ∈ {"CVAE1_5" (default), "CVAE1_10", "CVAE1_20"}
Each Subset has .expression (DataFrame),
.groups (Series), and .metadata (dict).
Available Datasets
Code |
Cancer name |
Samples |
|---|---|---|
BLCA |
Bladder Urothelial Carcinoma |
432 |
BRCA |
Breast Invasive Carcinoma |
1144 |
COAD |
Colon Adenocarcinoma |
414 |
ESCA |
Esophageal Carcinoma |
200 |
HNSC |
Head and Neck Squamous Cell Carcinoma |
569 |
KICH |
Kidney Chromophobe |
91 |
KIRC |
Kidney Renal Clear Cell Carcinoma |
613 |
KIRP |
Kidney Renal Papillary Cell Carcinoma |
295 |
LAML |
Acute Myeloid Leukemia |
187 |
LIHC |
Liver Hepatocellular Carcinoma |
352 |
LUAD |
Lung Adenocarcinoma |
557 |
LUSC |
Lung Squamous Cell Carcinoma |
516 |
OV |
Ovarian Serous Cystadenocarcinoma |
502 |
PAAD |
Pancreatic Adenocarcinoma |
183 |
PCPG |
Pheochromocytoma and Paraganglioma |
187 |
PRAD |
Prostate Adenocarcinoma |
536 |
READ |
Rectum Adenocarcinoma |
144 |
SARC |
Sarcoma |
263 |
SKCM |
Skin Cutaneous Melanoma |
449 |
STAD |
Stomach Adenocarcinoma |
490 |
TGCT |
Testicular Germ Cell Tumors |
139 |
THCA |
Thyroid Carcinoma |
573 |
THYM |
Thymoma |
123 |
UCS |
Uterine Carcinosarcoma |
57 |
Sample counts reflect TC-normalized real expression. Richer per-cohort
metadata is available at runtime via the TCGADataset
attributes (ds.cancer_type, ds.group_labels, ds.schema_version,
etc.) and per-slice on the underlying Subset objects
(ds.processed[norm].metadata, ds.synthetic[norm][model].metadata).
Working with TCGADataset
real() and synth() return tuples
Both real() and
synth() return a
tuple[pandas.DataFrame, pandas.Series] — the expression matrix and
the aligned group labels:
expression—pandas.DataFrameof shape(n_samples, n_features). Columns are miRNA features.groups—pandas.Seriesaligned toexpression.indexwith group labels (e.g. tumor vs. normal).
This is the most common usage pattern: pass expression to a model
and groups to a classifier.
Reaching the underlying Subset for metadata
For per-slice metadata (HDF5 attributes captured at dataset assembly
time — normalization method, transform, KL weight, epochs trained,
etc.), reach the underlying Subset directly:
real_subset = ds.processed["TC"] # Subset for real / TC
synth_subset = ds.synthetic["TC"]["CVAE1_5"] # Subset for synthetic / TC / CVAE1_5
real_subset.expression # same DataFrame returned by ds.real("TC")[0]
real_subset.groups # same Series returned by ds.real("TC")[1]
real_subset.metadata # dict of HDF5 attributes
Choosing a normalization
Three normalizations are available (constants in
syng_bts.tcga.VALID_NORMALIZATIONS):
"raw_norm"— raw counts after preprocessing"TC"— total-count normalized; common starting point for miRNA differential analyses"DESeq"(default) — DESeq2 size-factor normalized; preferred when downstream analysis assumes DESeq2 conventions
Choosing a synthetic model
Three CVAE variants (constants in syng_bts.tcga.VALID_MODELS)
trade off reconstruction fidelity vs. latent-space diversity via the KL
weight:
"CVAE1_5"(default) — KL weight 5; balanced"CVAE1_10"— KL weight 10; higher diversity, lower reconstruction"CVAE1_20"— KL weight 20; highest diversity
Caching and Offline Use
Cache location
By default, datasets are cached under
~/.cache/syng-bts/tcga/<version>/, where <version> is the
manifest version (e.g. data-v1.0). Inspect the active cache root at
runtime:
from syng_bts import tcga_cache_dir
tcga_cache_dir()
# PosixPath('/home/alice/.cache/syng-bts/tcga')
Custom cache directory
Set the SYNG_BTS_CACHE_DIR environment variable to override the
cache root. Useful for shared filesystems, CI runners with restricted
homes, or when you want all SyNG-BTS caches in one place:
export SYNG_BTS_CACHE_DIR=/data/shared/syng-bts-cache
First-call download semantics
The first call to load_tcga_dataset() for a given
dataset fetches the manifest, downloads the corresponding HDF5
(~10–35 MB), verifies its sha256, and atomically renames the temporary
file into the versioned cache directory. The download will retry once
on sha256 mismatch or transient network error before raising.
Force redownload and cleanup
To redownload a single cohort (e.g. after a corrupt download or to pick
up an updated file from a custom mirror), pass force=True:
ds = load_tcga_dataset("BRCA", force=True)
To remove the entire TCGA cache (all versions, all cohorts):
from syng_bts import clear_tcga_cache
clear_tcga_cache()
Advanced
Custom manifest URL
For staging or private mirrors, pass manifest_url= to
load_tcga_dataset() or
list_tcga_datasets():
manifest = "https://my-mirror.example/tcga/manifest.json"
ds = load_tcga_dataset("BRCA", manifest_url=manifest)
The mirror must serve a manifest with the same schema as the published
data-v1.0 manifest.
Dataset versioning
The cache is keyed by manifest version. Releasing a new manifest (e.g.
data-v1.1) creates a new versioned subdirectory under
tcga_cache_dir() and does not invalidate or replace existing
data-v1.0 files. Pinning to an older manifest with manifest_url=
will still read from its versioned cache subdirectory.
Errors users may see
ValueError("Corrupt HDF5 at {path}; pass force=True to redownload.")— the cached file is unreadable. Passforce=Trueto redownload.OSError— manifest unreachable or sha256 mismatch persists after one retry. Check connectivity and themanifest_urlargument.
See also
Synthetic Data Generation — using TCGA data with
generate()/pilot_study()Evaluation Functions — visual evaluation of real vs. synthetic
API Reference — full API reference for the TCGA symbols