API Reference
This page documents the complete public API of SyNG-BTS.
Experiment Functions
These are the main entry points for training generative models and producing synthetic data. All three functions accept data as a pandas DataFrame, a CSV file path, or a bundled dataset name, and return rich result objects.
generate
- syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]
Train a deep generative model and generate synthetic data.
This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy
ApplyExperimentfunction.- Parameters:
data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g.
"SKCMPositive_4").name (str or None) – Short name for output filenames. Derived automatically when
None.groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
Generation size.
- If
int: generate exactlynew_sizesamples. For grouped data, counts are split by the input group ratio and rounded to integers.
- If
- If
list[int]: explicit grouped counts [n_group_0, n_group_1].
- If
For grouped data,
group_0is the base group used bycreate_labels()(first encountered group value) andgroup_1is the other group.model (str) – Model specification, e.g.
"VAE1-10"(parsed into model type and kl_weight).apply_log (bool) – Apply
log2(x + 1)preprocessing.batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) –
Fixed epoch count, or
Nonefor early stopping.The interaction between epoch and early_stop_patience:
epochearly_stop_patienceBehaviour
NoneNoneEarly stopping ON, patience=30, max 1 000 epochs
None30Early stopping ON, patience=30, max 1 000 epochs
500NoneEarly stopping OFF, run exactly 500 epochs
50030Early stopping ON, patience=30, max 500 epochs
val_ratio (float) – Validation split ratio (AE family only).
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30.off_aug (str or None) – Offline augmentation:
"AE_head","Gaussian_head", orNone.AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
use_scheduler (bool) – Enable learning-rate scheduler (AE family).
step_size (int) – Scheduler step size.
gamma (float) – Scheduler gamma.
cap (bool) – Cap generated values to observed range.
random_seed (int) – Random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default:
False).output_dir (str, Path, or None) – If set, automatically save results to this directory.
Verbosity level for training output.
"silent"or0— no output during training."minimal"or1(default) — print only training summaries and early-stopping messages."detailed"or2— print per-epoch progress (epoch number, loss values, elapsed time, learning rate).
- Returns:
Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.
- Return type:
pilot_study
- syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') PilotResult[source]
Sweep over pilot sizes with replicated random draws.
For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.
This replaces the legacy
PilotExperimentfunction.- Parameters:
data (DataFrame, str, or Path) – Input data.
name (str or None) – Short name for output filenames.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.
model (str) – Model specification (e.g.
"VAE1-10").apply_log (bool) – Apply
log2(x + 1)preprocessing.batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) – Fixed epoch count or
Nonefor early stopping. Seegenerate()for the full interaction table.early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30. Seegenerate()for the full interaction table.off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Base random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see
generate().output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) – Verbosity level — see
generate()for details.
- Returns:
Wrapper containing one
SyngResultper (pilot_size, draw).- Return type:
PilotResult
transfer
- syng_bts.transfer(source_data: DataFrame | str | Path, target_data: DataFrame | str | Path, *, source_name: str | None = None, target_name: str | None = None, source_groups: Series | ndarray | None = None, target_groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') SyngResult[source]
Train on source data, then fine-tune and generate on target data.
The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a
SyngResult.This replaces the legacy
TransferExperimentfunction.- Parameters:
source_data (DataFrame, str, or Path) – Pre-training dataset.
target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.
source_name (str or None) – Short name for the source dataset.
target_name (str or None) – Short name for the target dataset.
source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.
target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.
Generation size for the fine-tuned target model.
- If
int: generate exactlynew_sizesamples. For grouped data, counts are split by the target input group ratio and rounded to integers.
- If
- If
list[int]: explicit grouped counts [n_group_0, n_group_1].
- If
For grouped data,
group_0is the base group used bycreate_labels()(first encountered group value) andgroup_1is the other group.model (str) – Model specification.
apply_log (bool) – Apply log2 preprocessing.
batch_frac (float) – Batch fraction.
learning_rate (float) – Learning rate.
epoch (int or None) – Fixed epoch count, or
Nonefor early stopping. Seegenerate()for the full interaction table.early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When
Noneandepochis alsoNone, defaults to30. Seegenerate()for the full interaction table.off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Random seed.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see
generate().output_dir (str, Path, or None) – If set, save results here.
verbose (int or str) – Verbosity level — see
generate()for details.
- Returns:
Result from the fine-tuned target-phase model.
- Return type:
Result Objects
Experiment functions return result objects that carry generated data, loss logs, reconstructed data, and model state as attributes.
SyngResult
- class syng_bts.SyngResult(generated_data: DataFrame, loss: DataFrame, reconstructed_data: DataFrame | None = None, original_data: DataFrame | None = None, model_state: dict[str, ~typing.Any] | None=None, metadata: dict[str, ~typing.Any]=<factory>, original_groups: Series | None = None, generated_groups: Series | None = None, reconstructed_groups: Series | None = None)[source]
Bases:
objectResult of a single SyNG-BTS model training and generation run.
- generated_data
Synthetic samples with the original column names preserved.
- Type:
pd.DataFrame
- loss
Training loss log (columns depend on the model family).
- Type:
pd.DataFrame
- reconstructed_data
Reconstructions of the input data (AE/VAE/CVAE only).
- Type:
pd.DataFrame or None
- original_data
The full original input data.
- Type:
pd.DataFrame or None
- model_state
The
state_dict()of the trained model, suitable fortorch.save()/torch.load().- Type:
dict or None
- metadata
Run parameters and summary statistics, e.g. model name, kl_weight, seed, epoch count, input data dimensions.
- Type:
- original_groups
Group labels for the original input data. Populated when groups were provided or bundled with the dataset.
- Type:
pd.Series or None
- generated_groups
Group labels for the generated data, derived from the label column produced during generation and mapped back to the original group values.
- Type:
pd.Series or None
- reconstructed_groups
Group labels for the reconstructed data (AE/VAE/CVAE only), derived from the label column and mapped back to original group values.
- Type:
pd.Series or None
Examples
>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5) >>> result.generated_data.head() >>> result.save("./my_output/") >>> figs = result.plot_loss() # dict[str, Figure]
- generate_new_samples(n: int, *, mode: str = 'new') SyngResult[source]
Generate new synthetic samples from the trained model.
This method reuses the same generation and post-processing path as
generate(), applying the same inverse-log transform and column naming.- Parameters:
n (int) – Number of new samples to generate.
mode (str) –
How to incorporate the new samples:
"new"(default): return a newSyngResultwhosegenerated_datacontains only the newly generated samples. All other fields (loss, metadata, model_state, etc.) are copied fromself."overwrite": replaceself.generated_datawith the new samples and returnself."append": append the new samples toself.generated_dataand returnself.
- Returns:
The result containing the new samples (see mode).
- Return type:
- Raises:
ValueError – If
model_stateisNone,arch_paramsis missing from metadata, or mode is not one of the accepted values.
Examples
>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5) >>> new_result = result.generate_new_samples(200) >>> new_result.generated_data.shape[0] 200
>>> # After save/load round-trip: >>> loaded = SyngResult.load("output/") >>> more = loaded.generate_new_samples(100, mode="append") >>> more.generated_data.shape[0] # original + 100
- save(output_dir: str | Path, prefix: str | None = None) dict[str, Path][source]
Save all non-None results to output_dir.
Files are written into a single flat directory. CSVs include column headers. Model state is saved as a
.ptfile. Metadata is written as a human-readable JSON file.- Parameters:
- Returns:
Mapping of output type (
"generated","loss","reconstructed","model","metadata") to the written file path.- Return type:
- plot_loss(running_average_window: int = 25, x_axis: str = 'epochs') dict[str, matplotlib.pyplot.Figure][source]
Plot the training loss curve(s), one figure per loss column.
Each returned figure shows the raw loss series (
alpha=0.4) and a running-average overlay.- Parameters:
- Returns:
{loss_column_name: figure}for every column inself.loss.- Return type:
- Raises:
ValueError – If running_average_window ≤ 0, if x_axis is not
"iterations"or"epochs", ifx_axis="epochs"butmetadata["epochs_trained"]is missing or ≤ 0, or if the window is larger than a loss series.
- plot_heatmap(which: str = 'generated', log_scale: bool = True) matplotlib.pyplot.Figure[source]
Render a seaborn heatmap of generated or reconstructed data.
- Parameters:
- Returns:
The heatmap figure (not shown; caller decides when to display).
- Return type:
matplotlib.figure.Figure
- Raises:
ValueError – If which is
"reconstructed"but no reconstructed data exists, or if which is not a recognised value.
- summary() str[source]
Return a short textual summary of this result.
- Returns:
A paragraph describing the run dimensions, epoch count, and final loss values.
- Return type:
- classmethod load(directory: str | Path, prefix: str | None = None) SyngResult[source]
Load a previously saved
SyngResultfrom disk.- Parameters:
- Returns:
Reconstructed result with all available artifacts.
- Return type:
- Raises:
FileNotFoundError – If the required
*_generated.csvor*_loss.csvfile is missing.ValueError – If prefix is
Noneand zero or more than one*_generated.csvfile is found (ambiguous).
PilotResult
- class syng_bts.PilotResult(runs: dict[tuple[int, int], ~syng_bts.result.SyngResult] = <factory>, original_data: ~pandas.DataFrame | None = None, metadata: dict[str, ~typing.Any] = <factory>)[source]
Bases:
objectResult of a pilot study run across multiple pilot sizes and draws.
- runs
Mapping of
(pilot_size, draw_index)→ individual run result.draw_indexis 1-based (1 through 5).- Type:
dict[tuple[int, int], SyngResult]
- original_data
The full original input data (before subsetting).
- Type:
pd.DataFrame or None
- metadata
Shared metadata across all runs (model, data dimensions, etc.).
- Type:
Examples
>>> result = pilot_study(data="SKCMPositive_4", pilot_size=[50, 100], ...) >>> result.runs[(50, 1)].generated_data.head() >>> result.save("./pilot_output/")
- save(output_dir: str | Path, prefix: str | None = None) dict[tuple[int, int], dict[str, Path]][source]
Save all individual run results to output_dir.
Each run is saved with a filename that encodes the pilot size and draw index.
- Parameters:
- Returns:
Nested mapping:
(pilot_size, draw) → {output_type → path}.- Return type:
- plot_loss(style: str = 'overlay_runs', running_average_window: int = 25, x_axis: str = 'epochs', truncate: bool = True) dict[tuple[int, int], dict[str, matplotlib.pyplot.Figure]] | dict[str, matplotlib.pyplot.Figure][source]
Plot loss curves for every run in the pilot study.
- Parameters:
style (str) –
Plotting style for loss trajectories.
"per_run"(default): one figure per run per loss column, delegating toSyngResult.plot_loss()."overlay_runs": overlay all runs on the same plot for each loss column. Only the running-average line is drawn per run (no raw trace) to keep the plot readable."mean_band": plot the mean loss trajectory across all runs for each loss column, with a shaded ±1 std band. Mean and std are computed on raw loss values; the mean line is then optionally smoothed with a running average.
For all styles, y-axis scaling is applied to reduce the effect of large initial spikes (analogous to
SyngResult.plot_loss()).running_average_window (int) – Window size for the running-average overlay. Must be > 0. Default: 25.
x_axis (str) –
"epochs"(default) maps the x-axis to epoch space using each run’smetadata["epochs_trained"]."iterations"numbers data points 0…N-1.truncate (bool) – Only relevant for
style="mean_band"andstyle="overlay_runs". WhenTrue(default), only epochs/iterations common to all runs are plotted (truncated to the shortest run). WhenFalse, all epochs/iterations are plotted; statistics are computed from whichever runs still have data at each point.
- Returns:
style="per_run": nested dict keyed by(pilot_size, draw)→{column: Figure}.style="overlay_runs"orstyle="mean_band": flat dict{column: Figure}.- Return type:
dict[tuple[int, int], dict[str, Figure]] or dict[str, Figure]
- Raises:
ValueError – If style is not one of the accepted values, if running_average_window ≤ 0, or if x_axis is invalid.
Examples
>>> figs = pilot_result.plot_loss(style="overlay_runs") >>> figs = pilot_result.plot_loss(style="mean_band", truncate=False)
Evaluation Functions
Functions for evaluating and visualizing generated data.
heatmap_eval
- syng_bts.heatmap_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, cmap: str = 'YlGnBu') matplotlib.figure.Figure[source]
Create a heatmap visualization comparing real and generated data.
If only one dataset is provided, displays a single heatmap. If both real and generated data are provided, displays them side by side.
- Parameters:
real_data (pd.DataFrame) – The original/real data.
generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If
None, only real_data is plotted.apply_log (bool, default True) – Whether to apply
log2(x + 1)transformation to both real and generated data before visualization.cmap (str, default
"YlGnBu") – Colormap passed toseaborn.heatmap().
- Returns:
The matplotlib Figure containing the heatmap(s).
- Return type:
Figure
UMAP_eval
- syng_bts.UMAP_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, groups_real: Series | None = None, groups_generated: Series | None = None, random_seed: int = 42, legend_pos: str = 'best') matplotlib.figure.Figure[source]
Create a UMAP visualization comparing real and generated data.
Uses UMAP dimensionality reduction to visualize high-dimensional data in 2D, with optional group colouring.
- Parameters:
real_data (pd.DataFrame) – The original/real data.
generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If
None, only real_data is visualised.apply_log (bool, default True) – Whether to apply
log2(x + 1)transformation to both real and generated data before dimensionality reduction.groups_real (pd.Series or None, optional) – Group labels for real samples. Used for styling.
groups_generated (pd.Series or None, optional) – Group labels for generated samples. Used for styling.
random_seed (int, default 42) – Random seed for UMAP reproducibility.
legend_pos (str, default
"best") – Legend position ("best","upper right","lower left", …).
- Returns:
The matplotlib Figure containing the UMAP scatter plot.
- Return type:
Figure
evaluation
- syng_bts.evaluation(real_data: DataFrame | str | Path, generated_data: DataFrame | str | Path, *, real_groups: Series | ndarray | list | tuple | Index | None = None, generated_groups: Series | ndarray | list | tuple | Index | None = None, n_samples: int | None = 200, apply_log: bool = True, random_seed: int = 42) dict[str, matplotlib.figure.Figure][source]
Preprocessing and visualization of generated vs real data.
Loads and preprocesses the input data, then creates heatmap and UMAP visualizations comparing generated and real datasets.
- Parameters:
real_data (pd.DataFrame, str, or Path) – The original/real dataset. Accepts a DataFrame, a file path, or a bundled dataset name (resolved via
resolve_data()).generated_data (pd.DataFrame, str, or Path) – The generated/synthetic dataset. Same input types as real_data.
real_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the real samples. When provided, takes precedence over any bundled groups resolved from real_data. Values are used as-is for plot labels (converted to
str).generated_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the generated samples. When provided, takes precedence over any bundled groups resolved from generated_data. Values are used as-is for plot labels (converted to
str).n_samples (int or None, default 200) – Number of samples from each end of the dataset to use for visualization (to keep UMAP fast). If
None, all samples are used.apply_log (bool, default True) – Whether to apply
log2(x + 1)transformation to both real and generated data before comparison.random_seed (int, default 42) – Random seed for UMAP reproducibility.
- Returns:
{"heatmap": <Figure>, "umap": <Figure>}— the two evaluation figures. Neither figure has been displayed; the caller decides when to callplt.show()orfig.savefig().- Return type:
Sample-Size Evaluation (SyntheSize)
Classifier-based sample-size evaluation using inverse power-law learning curves. See Sample-Size Evaluation (SyntheSize) for full usage guide.
evaluate_sample_sizes
- syng_bts.evaluate_sample_sizes(data: pd.DataFrame | SyngResult, sample_sizes: list[int] | np.ndarray | pd.Series | int, groups: np.ndarray | pd.Series | list | None = None, which: str = 'generated', n_draws: int = 5, apply_log: bool = True, methods: list[str] | None = None, verbose: int | str = 'minimal') pd.DataFrame[source]
Evaluate classifiers across candidate sample sizes.
For each classifier and each candidate sample size, performs n_draws rounds of stratified sampling (proportional to class distribution), applies 5-fold cross-validation, and averages metrics across folds.
- Parameters:
data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a
SyngResultis provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding*_groupsfield.sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example,
sample_sizes=3with 15-row data produces[5, 10, 15].groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a
pd.DataFrame. When provided alongside aSyngResult, overrides the auto-resolved groups.which (str, default
"generated") – Selector when data is aSyngResult:"generated","original", or"reconstructed".n_draws (int, default 5) – Number of resampling repetitions for each sample size.
apply_log (bool, default True) – When
True, alog2(x + 1)transform is applied to the data before evaluation.methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names (
'LOGIS','SVM','KNN','RF','XGB') and common aliases ('LOGISTIC','LR','RANDOM_FOREST','XGBOOST'). Defaults to all five classifiers.verbose (int or str, default "minimal") – Controls output verbosity. Accepts
0/"silent"(no output),1/"minimal"(one dynamic overall progress bar across all sample sizes, draws, and methods), or2/"detailed"(per-draw/method metric lines).
- Returns:
Columns:
total_size,draw,method,f1_score,accuracy,auc.- Return type:
pd.DataFrame
- Raises:
TypeError – If data is not a
pd.DataFrameorSyngResult.ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows.
Examples
Using a DataFrame:
>>> df = pd.read_csv("mydata.csv") >>> groups = df.pop("group") >>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)
Using a SyngResult:
>>> from syng_bts import generate >>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10) >>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")
plot_sample_sizes
- syng_bts.plot_sample_sizes(metric_real: DataFrame, n_target: int | list, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score') matplotlib.pyplot.Figure[source]
Visualize IPLF learning curves fitted from evaluation metrics.
Fits inverse power-law curves to the evaluation metrics produced by
evaluate_sample_sizes()and plots observed values, fitted curves, and 95% confidence intervals.The returned figure is never displayed automatically — call
fig.savefig(...)orplt.show()explicitly to display or save.- Parameters:
metric_real (pd.DataFrame) – Metrics from
evaluate_sample_sizes()on real data.n_target (int or list) – Target sample sizes for extrapolation reference.
metric_generated (pd.DataFrame or None) – Metrics from
evaluate_sample_sizes()on generated data. When provided, a second column of panels is added.metric_name (str, default
"f1_score") – Metric to visualize ("f1_score","accuracy", or"auc").
- Returns:
The figure containing the learning-curve panels.
- Return type:
matplotlib.figure.Figure
Examples
>>> metrics = evaluate_sample_sizes(df, [50, 100, 200], groups=g) >>> fig = plot_sample_sizes(metrics, n_target=300) >>> fig.savefig("learning_curves.png")
Data Utilities
Functions for loading and managing datasets.
resolve_data
- syng_bts.resolve_data(data: DataFrame | str | Path) tuple[DataFrame, Series | None][source]
Resolve a flexible data input to a pandas DataFrame and optional groups.
Accepts a DataFrame (returned as-is with
Nonegroups), a file path (loaded viapd.read_csv/pd.read_parquet), or the name of a bundled dataset.- Parameters:
data (pd.DataFrame, str, or Path) –
One of:
A
pd.DataFrame— returned directly with groupsNone.A
strorPathpointing to an existing CSV or Parquet file (must include an extension such as.csvor.parquet).A plain name (no extension, no path separators) of a bundled dataset, e.g.
"SKCMPositive_4".
- Returns:
(features_df, groups_or_none). Groups are apd.Seriesonly when the input is a bundled dataset that ships with a groups sidecar. For user-provided files and DataFrames, groups are alwaysNone.- Return type:
tuple[pd.DataFrame, pd.Series | None]
- Raises:
ValueError – If data looks like a bundled-dataset name but is not found in the registry. The error message lists all available bundled datasets.
FileNotFoundError – If data looks like a file path but the file does not exist.
TypeError – If data is not a DataFrame, str, or Path.
Examples
>>> from syng_bts.data_utils import resolve_data >>> df, groups = resolve_data("SKCMPositive_4") # bundled >>> df, groups = resolve_data("./my_data/custom.csv") # file path >>> df, groups = resolve_data(existing_dataframe) # pass-through
list_bundled_datasets
- syng_bts.list_bundled_datasets() list[source]
List all available bundled datasets.
- Returns:
List of dataset names that can be loaded with
resolve_data().- Return type:
TCGA Datasets
The TCGA loader downloads, caches, and exposes 24 packaged TCGA miRNA cohorts (raw + normalized + CVAE-synthesized). See TCGA Datasets for the narrative guide and TCGA Quick Start for a runnable example.
load_tcga_dataset
- syng_bts.load_tcga_dataset(name: str, *, force: bool = False, manifest_url: str | None = None) TCGADataset[source]
Download (if needed) and return a TCGA cohort as a
TCGADataset.On first call for a given dataset, the loader fetches the manifest, downloads the corresponding HDF5 file, verifies its sha256, and caches the file under
tcga_cache_dir() / <version>. Subsequent calls reuse the cached file.- Parameters:
name – Cohort code (e.g.
"BRCA") or full dataset name from the manifest. Cancer-type prefixes resolve to the canonical entry when unambiguous.force – If
True, redownload even when a cached file exists.manifest_url – Override the published manifest URL (useful for staging mirrors). Defaults to the data-v1.0 release manifest.
- Returns:
A
TCGADatasetexposingTCGADataset.real()andTCGADataset.synth()accessors.- Raises:
ValueError – If
namedoes not match any cohort in the manifest, if the cached HDF5 file is corrupt (passforce=Trueto redownload), or if the sha256 checksum fails twice after retry.OSError – If the manifest or HDF5 file cannot be downloaded due to a network failure.
Example
>>> from syng_bts import load_tcga_dataset >>> ds = load_tcga_dataset("BRCA") >>> real_df, real_groups = ds.real() >>> real_df.shape (1144, 570)
list_tcga_datasets
- syng_bts.list_tcga_datasets(*, short: bool = False, manifest_url: str | None = None) list[str][source]
Return the names of all TCGA cohorts in the published manifest.
- Parameters:
short – If
True, return short cohort codes (e.g."BRCA"). Otherwise return the full manifest dataset names.manifest_url – Override the published manifest URL. Defaults to the data-v1.0 release manifest.
- Returns:
A list of dataset names. With
short=False(default), names are the full manifest entries; withshort=True, the leading cohort code.
Example
>>> from syng_bts import list_tcga_datasets >>> list_tcga_datasets(short=True)[:3] ['ACC', 'BLCA', 'BRCA']
tcga_cache_dir
- syng_bts.tcga_cache_dir() Path[source]
Return the active TCGA cache directory (without the version subdir).
Honors the
SYNG_BTS_CACHE_DIRenvironment variable if set; otherwise returns~/.cache/syng-bts/tcga. The directory is not created by this call.- Returns:
The cache root for TCGA datasets. Versioned dataset files live under
tcga_cache_dir() / <manifest-version>.
Example
>>> from syng_bts import tcga_cache_dir >>> tcga_cache_dir() PosixPath('/home/alice/.cache/syng-bts/tcga')
clear_tcga_cache
- syng_bts.clear_tcga_cache() None[source]
Remove the entire TCGA cache directory.
Deletes
tcga_cache_dir()recursively. The next call toload_tcga_dataset()will redownload from the manifest. Use this for full cleanup; for per-dataset redownload, preferload_tcga_dataset()withforce=True.Example
>>> from syng_bts import clear_tcga_cache >>> clear_tcga_cache()
TCGADataset
- class syng_bts.TCGADataset(*, name: str, cancer_type: str, clinical_variable: str, group_labels: list[str], n_raw_samples: int, n_filtered_samples: int, n_raw_features: int, n_filtered_features: int, schema_version: str, creation_date: str, syng_bts_version: str, raw: Subset, processed: dict[str, Subset], synthetic: dict[str, dict[str, Subset]])[source]
Bases:
objectA loaded TCGA cohort with real and synthetic accessors.
Returned by
load_tcga_dataset(). Wraps a single HDF5 file containing the raw expression matrix, three normalizations (raw_norm,TC,DESeq), and nine synthetic groups (three CVAE models × three normalizations).Use
real()to access real expression data andsynth()to access a synthetic counterpart.- real(normalization: str = 'DESeq') tuple[DataFrame, Series][source]
Return
(expression, groups)for one processed normalization.- Parameters:
normalization – One of
"raw_norm","TC", or"DESeq"(default). Seesyng_bts.tcga.VALID_NORMALIZATIONS.- Returns:
A
(expression, groups)tuple whereexpressionis apandas.DataFrameof shape(n_samples, n_features)andgroupsis apandas.Seriesof group labels aligned toexpression.index. To access the per-slice metadata dict, use the underlyingSubsetdirectly viadataset.processed[normalization].- Raises:
ValueError – If
normalizationis not insyng_bts.tcga.VALID_NORMALIZATIONS.
Example
>>> ds = load_tcga_dataset("BRCA") >>> real_df, real_groups = ds.real("TC") >>> real_df.shape (1144, 570)
- synth(normalization: str = 'DESeq', model: str = 'CVAE1_5') tuple[DataFrame, Series][source]
Return
(expression, groups)for one synthetic configuration.- Parameters:
normalization – One of
"raw_norm","TC", or"DESeq"(default). Seesyng_bts.tcga.VALID_NORMALIZATIONS.model – One of
"CVAE1_5"(default),"CVAE1_10", or"CVAE1_20". Seesyng_bts.tcga.VALID_MODELS.
- Returns:
A
(expression, groups)tuple whereexpressionis apandas.DataFrameof shape(n_samples, n_features)andgroupsis apandas.Seriesof group labels aligned toexpression.index. To access the per-slice metadata dict (KL weight, epochs trained, etc.), use the underlyingSubsetdirectly viadataset.synthetic[normalization][model].- Raises:
ValueError – If
normalizationis not insyng_bts.tcga.VALID_NORMALIZATIONSormodelis not insyng_bts.tcga.VALID_MODELS.
Example
>>> ds = load_tcga_dataset("BRCA") >>> synth_df, synth_groups = ds.synth("TC", "CVAE1_5") >>> synth_df.shape (1000, 570)
Subset
- class syng_bts.Subset(expression: DataFrame, groups: Series, metadata: dict[str, Any])[source]
Bases:
objectAn expression subset returned by
TCGADatasetaccessors.- expression
A
pandas.DataFrameof shape(n_samples, n_features). Rows are TCGA samples (indexed by sample barcode); columns are miRNA features.- Type:
- groups
A
pandas.Seriesaligned toexpression.indexwith categorical group labels (e.g. tumor vs. normal).- Type:
- metadata
Dict of HDF5 attributes captured at dataset assembly time (e.g. version, normalization, source).
- expression: DataFrame
- groups: Series
Model Classes
Advanced users can access the model classes directly.
Note
These classes are for advanced usage. Most users should use the
experiment functions (generate, pilot_study, transfer)
which handle model creation and training automatically.
AE (Autoencoder)
VAE (Variational Autoencoder)
CVAE (Conditional VAE)
GAN (Generative Adversarial Network)
Package Information
Version and metadata information.
- syng_bts.__version__
The current version of SyNG-BTS.
- syng_bts.__author__
The package authors.
- syng_bts.__license__
The package license (AGPL-3.0).