Example Datasets
SyNG-BTS includes bundled datasets for testing and experimentation.
Note
For full TCGA cohorts (24 cancer types, real + synthetic, downloaded on demand), see TCGA Datasets. The bundled datasets on this page are small parquet files shipped inside the package for quick examples and case studies.
Overview
The bundled datasets come from TCGA (The Cancer Genome Atlas) studies:
BRCA - Breast Invasive Carcinoma
PRAD - Prostate Adenocarcinoma
SKCM - Skin Cutaneous Melanoma
Loading Datasets
Use the data utility functions to access bundled datasets:
from syng_bts import list_bundled_datasets, resolve_data
# List all available datasets
datasets = list_bundled_datasets()
print(datasets)
# Load a specific dataset as a DataFrame
data, groups = resolve_data("SKCMPositive_4")
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()[:5]}...")
print(f"Groups: {groups}") # None for datasets without group labels
Available Datasets
Example Datasets
Dataset Name |
Description |
|---|---|
|
SKCM miRNA-seq data with mean threshold filtering (log scale > 4) |
Transfer Learning Datasets
Dataset Name |
Description |
|---|---|
|
Breast Invasive Carcinoma miRNA-seq data |
|
Prostate Adenocarcinoma miRNA-seq data |
BRCA Subtype Case Study
Dataset Name |
Description |
|---|---|
|
BRCA with cancer subtypes (ILC, IDC), marker-filtered |
|
Training split of BRCASubtypeSel |
|
Test split of BRCASubtypeSel |
SyntheSize Evaluation Datasets
These datasets are used for classifier-based sample-size evaluation with
evaluate_sample_sizes() (see Sample-Size Evaluation (SyntheSize)).
Dataset Name |
Description |
|---|---|
|
Real BRCA subtype test data (200 rows × 47 features, ILC/IDC groups) |
|
CVAE-generated BRCA data (1000 rows × 47 features, string group labels, count scale) |
These can be used together to compare real vs. generated learning curves:
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data
real, groups_real = resolve_data("BRCASubtypeSel_test")
gen, groups_gen = resolve_data("BRCASubtypeSel_train_epoch285_CVAE1-20_generated")
metrics_real = evaluate_sample_sizes(real, [50, 100], groups=groups_real)
metrics_gen = evaluate_sample_sizes(gen, [50, 100], groups=groups_gen)
fig = plot_sample_sizes(metrics_real, n_target=200, metric_generated=metrics_gen)
fig.savefig("brca_learning_curves.png")
LIHC Subtype Case Study
Dataset Name |
Description |
|---|---|
|
Liver Hepatocellular Carcinoma subtype data |
|
LIHC with DESeq2 normalization |
|
Test split (74 samples) |
|
Test split with DESeq2 normalization |
|
Training split (294 samples) |
|
Training split with DESeq2 normalization |
Usage Examples
Case Study with BRCA Subtype
from syng_bts import generate
result = generate(
data="BRCASubtypeSel_train",
new_size=1000,
model="CVAE1-20",
apply_log=True,
batch_frac=0.1,
learning_rate=0.0005,
epoch=10,
)
print(result.generated_data.shape)
Using Custom Datasets
You can use your own datasets as DataFrames or CSV file paths:
import pandas as pd
from syng_bts import generate
# From a DataFrame
my_data = pd.read_csv("my_data.csv", index_col=0)
result = generate(data=my_data, name="my_data", model="VAE1-10", epoch=10)
# From a CSV path
result = generate(data="./custom_data/my_data.csv", model="VAE1-10", epoch=10)
# Save results to disk
result.save("./results/")
Your CSV or Parquet file should have:
Samples as rows
Features (genes/miRNAs) as columns
First column can be sample IDs or index
Do not include
groupsorsamplescolumns — pass group labels via thegroupsparameter instead
Data Source
The example datasets are derived from TCGA (The Cancer Genome Atlas) miRNA-seq data.
For more information about the data processing and marker selection, see the research paper:
Qin, L.-X., et al. (2025). Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. BMC Bioinformatics, 26. https://doi.org/10.1093/bib/bbaf097