Usage Guide
This guide covers installation and basic usage of SyNG-BTS.
Installation
Requirements: Python 3.10 or later.
From PyPI (Recommended)
Install SyNG-BTS using pip:
$ pip install syng-bts
From Source
For development or the latest features:
$ git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
$ cd SyNG-BTS
$ pip install -e .
Optional Dependencies
Install documentation dependencies:
$ pip install syng-bts[docs]
Install development dependencies (testing, linting):
$ pip install syng-bts[dev]
Install all optional dependencies:
$ pip install syng-bts[all]
Quick Start
Basic Import
After installation, import SyNG-BTS in your Python code:
from syng_bts import (
generate,
pilot_study,
transfer,
list_bundled_datasets,
resolve_data,
SyngResult,
PilotResult,
)
Browse Bundled Datasets
Use list_bundled_datasets() to browse available bundled datasets,
and resolve_data() to load them:
from syng_bts import list_bundled_datasets, resolve_data
# See available datasets
print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]
# Load a bundled dataset (returns a tuple of DataFrame and optional groups)
data, groups = resolve_data("SKCMPositive_4")
print(f"Dataset shape: {data.shape}")
Load TCGA Cohorts
For the full 24 TCGA miRNA cohorts (real + CVAE-synthesized, downloaded
on demand), use list_tcga_datasets() and
load_tcga_dataset():
from syng_bts import list_tcga_datasets, load_tcga_dataset
# Browse available cohorts
print(list_tcga_datasets(short=True))
# ['BLCA', 'BRCA', 'COAD', ..., 'UCS']
# Load a cohort (downloads on first call, then cached locally)
ds = load_tcga_dataset("BRCA")
# Real expression data — DESeq-normalized by default
real_df, real_groups = ds.real()
print(f"Real shape: {real_df.shape}")
# CVAE-synthesized counterpart
synth_df, synth_groups = ds.synth()
print(f"Synthetic shape: {synth_df.shape}")
See TCGA Datasets for the full guide (catalog, normalizations, caching, custom mirrors) and TCGA Quick Start for a runnable end-to-end example.
Generate Synthetic Data
Use generate() to train a generative model and produce
synthetic samples (see generate in Synthetic Data Generation):
from syng_bts import generate
result = generate(
data="SKCMPositive_4", # bundled dataset name, CSV path, or DataFrame
model="VAE1-10",
new_size=500,
batch_frac=0.1,
learning_rate=0.0005,
)
# Access results in memory
print(result.generated_data.shape) # (500, n_features)
print(result.loss.columns.tolist()) # ['kl', 'recons']
print(result.summary())
# Plot training loss (one figure per loss column)
figs = result.plot_loss()
figs["kl"].savefig("kl_loss.png")
# Optionally save to disk
result.save("./my_output/")
# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")
For grouped datasets, new_size supports two forms:
int: exact total generated sample count. Group counts follow theinput group ratio (rounded).
list[int]: explicit grouped counts[n_group_0, n_group_1].
Here, group_0 is the first group value encountered in the input
group labels, and group_1 is the other group.
Run a Pilot Study
Use pilot_study() to sweep over multiple pilot sizes with
replicated random draws (see pilot_study in Synthetic Data Generation):
from syng_bts import pilot_study
pilot = pilot_study(
data="SKCMPositive_4",
pilot_size=[50, 100],
model="VAE1-10",
batch_frac=0.1,
learning_rate=0.0005,
)
# Access individual runs
run = pilot.runs[(50, 1)] # (pilot_size, draw_index)
print(run.generated_data.head())
# Save all runs
pilot.save("./pilot_output/")
Use DataFrame Input
Pass your own data as a pandas DataFrame:
import pandas as pd
from syng_bts import generate
my_data = pd.read_csv("my_dataset.csv")
result = generate(
data=my_data,
name="my_dataset", # used in output filenames
model="WGANGP",
new_size=1000,
epoch=50,
)
Evaluate Results
Visualize generated data using plot_heatmap()
(on the result object) or the standalone heatmap_eval() and
UMAP_eval() functions (see Evaluation Functions):
from syng_bts import generate, heatmap_eval, UMAP_eval, resolve_data
result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
# Built-in heatmap on the result object
fig = result.plot_heatmap()
# Standalone evaluation comparing real and generated data
real_data, _groups = resolve_data("SKCMPositive_4")
heatmap_eval(real_data=real_data, generated_data=result.generated_data)
UMAP_eval(real_data=real_data, generated_data=result.generated_data)
Sample-Size Evaluation (SyntheSize)
Evaluate how classifier performance scales with sample size using
evaluate_sample_sizes() and
plot_sample_sizes() (see Sample-Size Evaluation (SyntheSize) for full details):
By default, evaluate_sample_sizes() applies
log2(x + 1) (apply_log=True). Set apply_log=False when input
data is already log-transformed.
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data
# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")
# Evaluate classifiers at different sample sizes
metrics = evaluate_sample_sizes(
data=data,
sample_sizes=[50, 100, 150],
groups=groups,
n_draws=5,
)
# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")
You can also pass a SyngResult directly — groups are
auto-resolved from the result object:
from syng_bts import generate, evaluate_sample_sizes
result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=10)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")
Next Steps
See Synthetic Data Generation for all synthetic data generation methods
See Sample-Size Evaluation (SyntheSize) for sample-size evaluation with SyntheSize
See Configuration Reference for all available parameters
See API Reference for the complete API reference
See Example Datasets for information about bundled datasets
See Migration Guide / Changelog for upgrading from v2.x