Example Datasets

SyNG-BTS includes bundled datasets for testing and experimentation.

Note

For full TCGA cohorts (24 cancer types, real + synthetic, downloaded on demand), see TCGA Datasets. The bundled datasets on this page are small parquet files shipped inside the package for quick examples and case studies.

Overview

The bundled datasets come from TCGA (The Cancer Genome Atlas) studies:

  • BRCA - Breast Invasive Carcinoma

  • PRAD - Prostate Adenocarcinoma

  • SKCM - Skin Cutaneous Melanoma

Loading Datasets

Use the data utility functions to access bundled datasets:

from syng_bts import list_bundled_datasets, resolve_data

# List all available datasets
datasets = list_bundled_datasets()
print(datasets)

# Load a specific dataset as a DataFrame
data, groups = resolve_data("SKCMPositive_4")
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()[:5]}...")
print(f"Groups: {groups}")  # None for datasets without group labels

Available Datasets

Example Datasets

Dataset Name

Description

SKCMPositive_4

SKCM miRNA-seq data with mean threshold filtering (log scale > 4)

Transfer Learning Datasets

Dataset Name

Description

BRCA

Breast Invasive Carcinoma miRNA-seq data

PRAD

Prostate Adenocarcinoma miRNA-seq data

BRCA Subtype Case Study

Dataset Name

Description

BRCASubtypeSel

BRCA with cancer subtypes (ILC, IDC), marker-filtered

BRCASubtypeSel_train

Training split of BRCASubtypeSel

BRCASubtypeSel_test

Test split of BRCASubtypeSel

SyntheSize Evaluation Datasets

These datasets are used for classifier-based sample-size evaluation with evaluate_sample_sizes() (see Sample-Size Evaluation (SyntheSize)).

Dataset Name

Description

BRCASubtypeSel_test

Real BRCA subtype test data (200 rows × 47 features, ILC/IDC groups)

BRCASubtypeSel_train_epoch285_CVAE1-20_generated

CVAE-generated BRCA data (1000 rows × 47 features, string group labels, count scale)

These can be used together to compare real vs. generated learning curves:

 from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

 real, groups_real = resolve_data("BRCASubtypeSel_test")
 gen, groups_gen = resolve_data("BRCASubtypeSel_train_epoch285_CVAE1-20_generated")

 metrics_real = evaluate_sample_sizes(real, [50, 100], groups=groups_real)
 metrics_gen = evaluate_sample_sizes(gen, [50, 100], groups=groups_gen)

 fig = plot_sample_sizes(metrics_real, n_target=200, metric_generated=metrics_gen)
fig.savefig("brca_learning_curves.png")

LIHC Subtype Case Study

Dataset Name

Description

LIHCSubtypeFamInd

Liver Hepatocellular Carcinoma subtype data

LIHCSubtypeFamInd_DESeq

LIHC with DESeq2 normalization

LIHCSubtypeFamInd_test74

Test split (74 samples)

LIHCSubtypeFamInd_test74_DESeq

Test split with DESeq2 normalization

LIHCSubtypeFamInd_train294

Training split (294 samples)

LIHCSubtypeFamInd_train294_DESeq

Training split with DESeq2 normalization

Usage Examples

Case Study with BRCA Subtype

from syng_bts import generate

result = generate(
    data="BRCASubtypeSel_train",
    new_size=1000,
    model="CVAE1-20",
    apply_log=True,
    batch_frac=0.1,
    learning_rate=0.0005,
    epoch=10,
)
print(result.generated_data.shape)

Using Custom Datasets

You can use your own datasets as DataFrames or CSV file paths:

import pandas as pd
from syng_bts import generate

# From a DataFrame
my_data = pd.read_csv("my_data.csv", index_col=0)
result = generate(data=my_data, name="my_data", model="VAE1-10", epoch=10)

# From a CSV path
result = generate(data="./custom_data/my_data.csv", model="VAE1-10", epoch=10)

# Save results to disk
result.save("./results/")

Your CSV or Parquet file should have:

  • Samples as rows

  • Features (genes/miRNAs) as columns

  • First column can be sample IDs or index

  • Do not include groups or samples columns — pass group labels via the groups parameter instead

Data Source

The example datasets are derived from TCGA (The Cancer Genome Atlas) miRNA-seq data.

For more information about the data processing and marker selection, see the research paper:

Qin, L.-X., et al. (2025). Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. BMC Bioinformatics, 26. https://doi.org/10.1093/bib/bbaf097