Example Datasets

SyNG-BTS includes bundled datasets for testing and experimentation.

Note

For full TCGA cohorts (24 cancer types, real + synthetic, downloaded on demand), see TCGA Datasets. The bundled datasets on this page are small parquet files shipped inside the package for quick examples and case studies.

Overview 

The bundled datasets come from TCGA (The Cancer Genome Atlas) studies:

BRCA - Breast Invasive Carcinoma
PRAD - Prostate Adenocarcinoma
SKCM - Skin Cutaneous Melanoma

Loading Datasets 

Use the data utility functions to access bundled datasets:

from syng_bts import list_bundled_datasets, resolve_data

# List all available datasets
datasets = list_bundled_datasets()
print(datasets)

# Load a specific dataset as a DataFrame
data, groups = resolve_data("SKCMPositive_4")
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()[:5]}...")
print(f"Groups: {groups}")  # None for datasets without group labels

Available Datasets 

Example Datasets 

Dataset Name	Description
`SKCMPositive_4`	SKCM miRNA-seq data with mean threshold filtering (log scale > 4)

Transfer Learning Datasets 

Dataset Name	Description
`BRCA`	Breast Invasive Carcinoma miRNA-seq data
`PRAD`	Prostate Adenocarcinoma miRNA-seq data

BRCA Subtype Case Study 

Dataset Name	Description
`BRCASubtypeSel`	BRCA with cancer subtypes (ILC, IDC), marker-filtered
`BRCASubtypeSel_train`	Training split of BRCASubtypeSel
`BRCASubtypeSel_test`	Test split of BRCASubtypeSel

SyntheSize Evaluation Datasets 

These datasets are used for classifier-based sample-size evaluation with evaluate_sample_sizes() (see Sample-Size Evaluation (SyntheSize)).

Dataset Name	Description
`BRCASubtypeSel_test`	Real BRCA subtype test data (200 rows × 47 features, ILC/IDC groups)
`BRCASubtypeSel_train_epoch285_CVAE1-20_generated`	CVAE-generated BRCA data (1000 rows × 47 features, string group labels, count scale)

These can be used together to compare real vs. generated learning curves:

 from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

 real, groups_real = resolve_data("BRCASubtypeSel_test")
 gen, groups_gen = resolve_data("BRCASubtypeSel_train_epoch285_CVAE1-20_generated")

 sample_sizes = [50, 100, 150, 200]
 metrics_real = evaluate_sample_sizes(real, sample_sizes, groups=groups_real)
 metrics_gen = evaluate_sample_sizes(gen, sample_sizes, groups=groups_gen)

 fig = plot_sample_sizes(metrics_real, metric_generated=metrics_gen)
fig.savefig("brca_learning_curves.png")

LIHC Subtype Case Study 

Dataset Name	Description
`LIHCSubtypeFamInd`	Liver Hepatocellular Carcinoma subtype data
`LIHCSubtypeFamInd_DESeq`	LIHC with DESeq2 normalization
`LIHCSubtypeFamInd_test74`	Test split (74 samples)
`LIHCSubtypeFamInd_test74_DESeq`	Test split with DESeq2 normalization
`LIHCSubtypeFamInd_train294`	Training split (294 samples)
`LIHCSubtypeFamInd_train294_DESeq`	Training split with DESeq2 normalization

Usage Examples 

Case Study with BRCA Subtype 

from syng_bts import generate

result = generate(
    data="BRCASubtypeSel_train",
    new_size=1000,
    model="CVAE1-20",
    apply_log=True,
    batch_frac=0.1,
    learning_rate=0.0005,
    epoch=10,
)
print(result.generated_data.shape)

Using Custom Datasets 

You can use your own datasets as DataFrames or CSV file paths:

import pandas as pd
from syng_bts import generate

# From a DataFrame
my_data = pd.read_csv("my_data.csv", index_col=0)
result = generate(data=my_data, name="my_data", model="VAE1-10", epoch=10)

# From a CSV path
result = generate(data="./custom_data/my_data.csv", model="VAE1-10", epoch=10)

# Save results to disk
result.save("./results/")

Your CSV or Parquet file should have:

Samples as rows
Features (genes/miRNAs) as columns
First column can be sample IDs or index
Do not include groups or samples columns — pass group labels via the groups parameter instead

Data Source 

The example datasets are derived from TCGA (The Cancer Genome Atlas) miRNA-seq data.

For more information about the data processing and marker selection, see the research paper:

Qin, L.-X., et al. (2025). Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. BMC Bioinformatics, 26. https://doi.org/10.1093/bib/bbaf097