API Reference

This page documents the complete public API of SyNG-BTS.

Experiment Functions 

These are the main entry points for training generative models and producing synthetic data. All three functions accept data as a pandas DataFrame, a CSV file path, or a bundled dataset name, and return rich result objects.

generate 

syng_bts.generate(data: DataFrame | str | Path, *, name: str | None = None, groups: Series | ndarray | None = None, new_size: int | list[int] = 500, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, val_ratio: float = 0.2, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, use_scheduler: bool = False, step_size: int = 10, gamma: float = 0.5, cap: bool = False, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') → SyngResult[source]

Train a deep generative model and generate synthetic data.

This is the primary entry point for training a single model and generating synthetic samples. It replaces the legacy ApplyExperiment function.

Parameters:

data (DataFrame, str, or Path) – Input data — a pandas DataFrame, a path to a CSV file, or the name of a bundled dataset (e.g. "SKCMPositive_4").
name (str or None) – Short name for output filenames. Derived automatically when None.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
new_size (int or list[int]) –
Generation size.
- If int: generate exactly new_size samples.
  For grouped data, counts are split by the input group ratio and rounded to integers.
- If list[int]: explicit grouped counts
  [n_group_0, n_group_1].
For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.
model (str) – Model specification, e.g. "VAE1-10" (parsed into model type and kl_weight).
apply_log (bool) – Apply log2(x + 1) preprocessing.
batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.

epoch (int or None) –

Fixed epoch count, or None for early stopping.

The interaction between epoch and early_stop_patience:

`epoch`	`early_stop_patience`	Behaviour
`None`	`None`	Early stopping ON, patience=30, max 1 000 epochs
`None`	`30`	Early stopping ON, patience=30, max 1 000 epochs
`500`	`None`	Early stopping OFF, run exactly 500 epochs
`500`	`30`	Early stopping ON, patience=30, max 500 epochs

val_ratio (float) – Validation split ratio (AE family only).
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30.
off_aug (str or None) – Offline augmentation: "AE_head", "Gaussian_head", or None.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
use_scheduler (bool) – Enable learning-rate scheduler (AE family).
step_size (int) – Scheduler step size.
gamma (float) – Scheduler gamma.
cap (bool) – Cap generated values to observed range.
random_seed (int) – Random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE (512→256→128→64 instead of 256→128→64). Suitable for high-dimensional data like RNA. Ignored for non-CVAE models (default: False).
output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) –
Verbosity level for training output.
- "silent" or 0 — no output during training.
- "minimal" or 1 (default) — print only training summaries and early-stopping messages.
- "detailed" or 2 — print per-epoch progress (epoch number, loss values, elapsed time, learning rate).

Returns:

Rich result object containing generated data, loss log, reconstructed data (AE/VAE/CVAE), model state, and metadata.

Return type:

SyngResult

pilot_study 

syng_bts.pilot_study(data: DataFrame | str | Path, pilot_size: list[int], *, name: str | None = None, groups: Series | ndarray | None = None, n_draws: int = 5, model: str = 'VAE1-10', apply_log: bool = True, batch_frac: float = 0.1, learning_rate: float = 0.0005, epoch: int | None = None, early_stop_patience: int | None = None, off_aug: str | None = None, AE_head_num: int = 2, Gaussian_head_num: int = 9, random_seed: int = 123, CVAE_wide_network: bool = False, output_dir: str | Path | None = None, verbose: int | str = 'minimal') → PilotResult[source]

Sweep over pilot sizes with replicated random draws.

For each pilot size, n_draws random sub-samples are drawn from the original data. A model is trained on each sub-sample and synthetic data equal to n_draws times the sub-sample size is generated.

This replaces the legacy PilotExperiment function.

Parameters:

data (DataFrame, str, or Path) – Input data.
pilot_size (list[int]) – List of pilot sizes to evaluate.
name (str or None) – Short name for output filenames.
groups (pd.Series, np.ndarray, or None) – Optional binary group labels. When provided, these labels take precedence over bundled dataset groups.
n_draws (int) – Number of replicated random draws per pilot size (default: 5). Must be a positive integer.
model (str) – Model specification (e.g. "VAE1-10").
apply_log (bool) – Apply log2(x + 1) preprocessing.
batch_frac (float) – Batch size as a fraction of sample count.
learning_rate (float) – Optimizer learning rate.
epoch (int or None) – Fixed epoch count or None for early stopping. See generate() for the full interaction table.
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.
off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Base random seed for reproducibility.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().
output_dir (str, Path, or None) – If set, automatically save results to this directory.
verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Wrapper containing one SyngResult per (pilot_size, draw).

Return type:

PilotResult

transfer 

Train on source data, then fine-tune and generate on target data.

The model is first trained on source_data and its learned state is kept in-memory, then fine-tuned on target_data. This is a single-run operation returning a SyngResult.

This replaces the legacy TransferExperiment function.

Parameters:

source_data (DataFrame, str, or Path) – Pre-training dataset.
target_data (DataFrame, str, or Path) – Fine-tuning / target dataset.
source_name (str or None) – Short name for the source dataset.
target_name (str or None) – Short name for the target dataset.
source_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the source dataset.
target_groups (pd.Series, np.ndarray, or None) – Optional binary groups for the target dataset.
new_size (int or list[int]) –
Generation size for the fine-tuned target model.
- If int: generate exactly new_size samples.
  For grouped data, counts are split by the target input group ratio and rounded to integers.
- If list[int]: explicit grouped counts
  [n_group_0, n_group_1].
For grouped data, group_0 is the base group used by create_labels() (first encountered group value) and group_1 is the other group.
model (str) – Model specification.
apply_log (bool) – Apply log2 preprocessing.
batch_frac (float) – Batch fraction.
learning_rate (float) – Learning rate.
epoch (int or None) – Fixed epoch count, or None for early stopping. See generate() for the full interaction table.
early_stop_patience (int or None) – Stop if loss does not improve for this many epochs. When None and epoch is also None, defaults to 30. See generate() for the full interaction table.
off_aug (str or None) – Offline augmentation mode.
AE_head_num (int) – Fold multiplier for AE-head augmentation.
Gaussian_head_num (int) – Fold multiplier for Gaussian-head augmentation.
random_seed (int) – Random seed.
CVAE_wide_network (bool) – Use wider encoder/decoder for CVAE — see generate().
output_dir (str, Path, or None) – If set, save results here.
verbose (int or str) – Verbosity level — see generate() for details.

Returns:

Result from the fine-tuned target-phase model.

Return type:

SyngResult

Result Objects 

Experiment functions return result objects that carry generated data, loss logs, reconstructed data, and model state as attributes.

SyngResult 

class syng_bts.SyngResult(generated_data: DataFrame, loss: DataFrame, reconstructed_data: DataFrame | None = None, original_data: DataFrame | None = None, model_state: dict[str, ~typing.Any] | None=None, metadata: dict[str, ~typing.Any]=<factory>, original_groups: Series | None = None, generated_groups: Series | None = None, reconstructed_groups: Series | None = None)[source]

Bases: object

Result of a single SyNG-BTS model training and generation run.

generated_data

Synthetic samples with the original column names preserved.

Type:: pd.DataFrame

loss

Training loss log (columns depend on the model family).

Type:: pd.DataFrame

reconstructed_data

Reconstructions of the input data (AE/VAE/CVAE only).

Type:: pd.DataFrame or None

original_data

The full original input data.

Type:: pd.DataFrame or None

model_state

The state_dict() of the trained model, suitable for torch.save() / torch.load().

Type:: dict or None

metadata

Run parameters and summary statistics, e.g. model name, kl_weight, seed, epoch count, input data dimensions.

Type:: dict

original_groups

Group labels for the original input data. Populated when groups were provided or bundled with the dataset.

Type:: pd.Series or None

generated_groups

Group labels for the generated data, derived from the label column produced during generation and mapped back to the original group values.

Type:: pd.Series or None

reconstructed_groups

Group labels for the reconstructed data (AE/VAE/CVAE only), derived from the label column and mapped back to original group values.

Type:: pd.Series or None

Examples

>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
>>> result.generated_data.head()
>>> result.save("./my_output/")
>>> figs = result.plot_loss()  # dict[str, Figure]

generate_new_samples(n: int, *, mode: str = 'new') → SyngResult[source]

Generate new synthetic samples from the trained model.

This method reuses the same generation and post-processing path as generate(), applying the same inverse-log transform and column naming.

Parameters:

n (int) – Number of new samples to generate.
mode (str) –
How to incorporate the new samples:
- "new" (default): return a new SyngResult whose generated_data contains only the newly generated samples. All other fields (loss, metadata, model_state, etc.) are copied from self.
- "overwrite": replace self.generated_data with the new samples and return self.
- "append": append the new samples to self.generated_data and return self.

Returns:

The result containing the new samples (see mode).

Return type:

SyngResult

Raises:

ValueError – If model_state is None, arch_params is missing from metadata, or mode is not one of the accepted values.

Examples

>>> result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
>>> new_result = result.generate_new_samples(200)
>>> new_result.generated_data.shape[0]
200

>>> # After save/load round-trip:
>>> loaded = SyngResult.load("output/")
>>> more = loaded.generate_new_samples(100, mode="append")
>>> more.generated_data.shape[0]  # original + 100

save(output_dir: str | Path, prefix: str | None = None) → dict[str, Path][source]

Save all non-None results to output_dir.

Files are written into a single flat directory. CSVs include column headers. Model state is saved as a .pt file. Metadata is written as a human-readable JSON file.

Parameters:

output_dir (str or Path) – Directory to write files into (created if it does not exist).
prefix (str or None) – Optional filename prefix. When None, uses metadata["dataname"] if available, otherwise "syng".

Returns:

Mapping of output type ("generated", "loss", "reconstructed", "model", "metadata") to the written file path.

Return type:

dict[str, Path]

plot_loss(running_average_window: int = 25, x_axis: str = 'epochs') → dict[str, matplotlib.pyplot.Figure][source]

Plot the training loss curve(s), one figure per loss column.

Each returned figure shows the raw loss series (alpha=0.4) and a running-average overlay.

Parameters:

running_average_window (int) – Window size for the running-average overlay. Must be > 0. Default: 25.
x_axis (str) – "epochs" (default) maps the x-axis to epoch space using metadata["epochs_trained"] (must be present and > 0). "iterations" numbers data points 0…N-1.

Returns:

{loss_column_name: figure} for every column in self.loss.

Return type:

dict[str, matplotlib.figure.Figure]

Raises:

ValueError – If running_average_window ≤ 0, if x_axis is not "iterations" or "epochs", if x_axis="epochs" but metadata["epochs_trained"] is missing or ≤ 0, or if the window is larger than a loss series.

plot_heatmap(which: str = 'generated', log_scale: bool = True) → matplotlib.pyplot.Figure[source]

Render a seaborn heatmap of generated or reconstructed data.

Parameters:

which (str) – "generated", "reconstructed", or "original".
log_scale (bool) – If True (default), apply log2(x + 1) scaling to the data before plotting. This compresses wide-ranging values and often produces more readable heatmaps.

Returns:

The heatmap figure (not shown; caller decides when to display).

Return type:

matplotlib.figure.Figure

Raises:

ValueError – If which is "reconstructed" but no reconstructed data exists, or if which is not a recognised value.

summary() → str[source]

Return a short textual summary of this result.

Returns:: A paragraph describing the run dimensions, epoch count, and final loss values.
Return type:: str

classmethod load(directory: str | Path, prefix: str | None = None) → SyngResult[source]

Load a previously saved SyngResult from disk.

Parameters:

directory (str or Path) – Directory that contains the saved files.
prefix (str or None) – The filename stem (everything before _generated.csv). When None, auto-detected from *_generated.csv files in the directory; exactly one match is required.

Returns:

Reconstructed result with all available artifacts.

Return type:

SyngResult

Raises:

FileNotFoundError – If the required *_generated.csv or *_loss.csv file is missing.
ValueError – If prefix is None and zero or more than one *_generated.csv file is found (ambiguous).

PilotResult 

class syng_bts.PilotResult(runs: dict[tuple[int, int], ~syng_bts.result.SyngResult] = <factory>, original_data: ~pandas.DataFrame | None = None, metadata: dict[str, ~typing.Any] = <factory>)[source]

Bases: object

Result of a pilot study run across multiple pilot sizes and draws.

runs

Mapping of (pilot_size, draw_index) → individual run result. draw_index is 1-based (1 through 5).

Type:: dict[tuple[int, int], SyngResult]

original_data

The full original input data (before subsetting).

Type:: pd.DataFrame or None

metadata

Shared metadata across all runs (model, data dimensions, etc.).

Type:: dict

Examples

>>> result = pilot_study(data="SKCMPositive_4", pilot_size=[50, 100], ...)
>>> result.runs[(50, 1)].generated_data.head()
>>> result.save("./pilot_output/")

save(output_dir: str | Path, prefix: str | None = None) → dict[tuple[int, int], dict[str, Path]][source]

Save all individual run results to output_dir.

Each run is saved with a filename that encodes the pilot size and draw index.

Parameters:

output_dir (str or Path) – Directory to write files into (created if it does not exist).
prefix (str or None) – Optional filename prefix. Falls back to metadata["dataname"] or "syng".

Returns:

Nested mapping: (pilot_size, draw) → {output_type → path}.

Return type:

dict[tuple[int, int], dict[str, Path]]

plot_loss(style: str = 'overlay_runs', running_average_window: int = 25, x_axis: str = 'epochs', truncate: bool = True) → dict[tuple[int, int], dict[str, matplotlib.pyplot.Figure]] | dict[str, matplotlib.pyplot.Figure][source]

Plot loss curves for every run in the pilot study.

Parameters:

style (str) –
Plotting style for loss trajectories.
- "per_run" (default): one figure per run per loss column, delegating to SyngResult.plot_loss().
- "overlay_runs": overlay all runs on the same plot for each loss column. Only the running-average line is drawn per run (no raw trace) to keep the plot readable.
- "mean_band": plot the mean loss trajectory across all runs for each loss column, with a shaded ±1 std band. Mean and std are computed on raw loss values; the mean line is then optionally smoothed with a running average.
For all styles, y-axis scaling is applied to reduce the effect of large initial spikes (analogous to SyngResult.plot_loss()).
running_average_window (int) – Window size for the running-average overlay. Must be > 0. Default: 25.
x_axis (str) – "epochs" (default) maps the x-axis to epoch space using each run’s metadata["epochs_trained"]. "iterations" numbers data points 0…N-1.
truncate (bool) – Only relevant for style="mean_band" and style="overlay_runs". When True (default), only epochs/iterations common to all runs are plotted (truncated to the shortest run). When False, all epochs/iterations are plotted; statistics are computed from whichever runs still have data at each point.

Returns:

style="per_run": nested dict keyed by (pilot_size, draw) → {column: Figure}. style="overlay_runs" or style="mean_band": flat dict {column: Figure}.

Return type:

dict[tuple[int, int], dict[str, Figure]] or dict[str, Figure]

Raises:

ValueError – If style is not one of the accepted values, if running_average_window ≤ 0, or if x_axis is invalid.

Examples

>>> figs = pilot_result.plot_loss(style="overlay_runs")
>>> figs = pilot_result.plot_loss(style="mean_band", truncate=False)

summary() → str[source]

Return an aggregate summary of all pilot runs.

Returns:: Multi-line summary with one line per run.
Return type:: str

Evaluation Functions 

Functions for evaluating and visualizing generated data.

heatmap_eval 

syng_bts.heatmap_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, cmap: str = 'YlGnBu') → matplotlib.figure.Figure[source]

Create a heatmap visualization comparing real and generated data.

If only one dataset is provided, displays a single heatmap. If both real and generated data are provided, displays them side by side.

Parameters:

real_data (pd.DataFrame) – The original/real data.
generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If None, only real_data is plotted.
apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before visualization.
cmap (str, default "YlGnBu") – Colormap passed to seaborn.heatmap().

Returns:

The matplotlib Figure containing the heatmap(s).

Return type:

Figure

UMAP_eval 

syng_bts.UMAP_eval(real_data: DataFrame, generated_data: DataFrame | None = None, *, apply_log: bool = True, groups_real: Series | None = None, groups_generated: Series | None = None, random_seed: int = 42, legend_pos: str = 'best') → matplotlib.figure.Figure[source]

Create a UMAP visualization comparing real and generated data.

Uses UMAP dimensionality reduction to visualize high-dimensional data in 2D, with optional group colouring.

Parameters:

real_data (pd.DataFrame) – The original/real data.
generated_data (pd.DataFrame or None, optional) – The generated/synthetic data. If None, only real_data is visualised.
apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before dimensionality reduction.
groups_real (pd.Series or None, optional) – Group labels for real samples. Used for styling.
groups_generated (pd.Series or None, optional) – Group labels for generated samples. Used for styling.
random_seed (int, default 42) – Random seed for UMAP reproducibility.
legend_pos (str, default "best") – Legend position ("best", "upper right", "lower left", …).

Returns:

The matplotlib Figure containing the UMAP scatter plot.

Return type:

Figure

evaluation 

Preprocessing and visualization of generated vs real data.

Loads and preprocesses the input data, then creates heatmap and UMAP visualizations comparing generated and real datasets.

Parameters:

real_data (pd.DataFrame, str, or Path) – The original/real dataset. Accepts a DataFrame, a file path, or a bundled dataset name (resolved via resolve_data()).
generated_data (pd.DataFrame, str, or Path) – The generated/synthetic dataset. Same input types as real_data.
real_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the real samples. When provided, takes precedence over any bundled groups resolved from real_data. Values are used as-is for plot labels (converted to str).
generated_groups (pd.Series, np.ndarray, list, tuple, pd.Index, or None, optional) – Group labels for the generated samples. When provided, takes precedence over any bundled groups resolved from generated_data. Values are used as-is for plot labels (converted to str).
n_samples (int or None, default 200) – Number of samples from each end of the dataset to use for visualization (to keep UMAP fast). If None, all samples are used.
apply_log (bool, default True) – Whether to apply log2(x + 1) transformation to both real and generated data before comparison.
random_seed (int, default 42) – Random seed for UMAP reproducibility.

Returns:

{"heatmap": <Figure>, "umap": <Figure>} — the two evaluation figures. Neither figure has been displayed; the caller decides when to call plt.show() or fig.savefig().

Return type:

dict[str, Figure]

Sample-Size Evaluation (SyntheSize)

Classifier-based sample-size evaluation using inverse power-law learning curves. See Sample-Size Evaluation (SyntheSize) for full usage guide.

evaluate_sample_sizes 

Evaluate classifiers across candidate sample sizes.

For each classifier and candidate sample size, performs n_draws rounds of stratified sampling proportional to the input class distribution. When no external test set is supplied, metrics are averaged over 5-fold stratified cross-validation. When test_data and test_groups are supplied, each classifier is trained on the complete candidate subset and evaluated once on those fixed external rows.

The returned total_size is the candidate subset size. Internal cross-validation trains each fold on about 80% of that subset; external evaluation trains on the complete subset.

Parameters:

data (pd.DataFrame or SyngResult) – The dataset to evaluate. When a SyngResult is provided, the which parameter selects the data attribute and groups are auto-resolved from the corresponding *_groups field.
sample_sizes (list[int], np.ndarray, pd.Series, or int) – Candidate sample sizes to evaluate. Accepts a list, numpy array, or pandas Series of positive integers. When a single int is provided it is interpreted as the number of equidistant sizes to create — the maximum equals the number of data rows. For example, sample_sizes=3 with 15-row data produces [5, 10, 15]. The grid count cannot exceed the number of data rows.
groups (array-like or None) – Class labels corresponding to the rows of data. Required when data is a pd.DataFrame. When provided alongside a SyngResult, overrides the auto-resolved groups.
which (str, default "generated") – Selector when data is a SyngResult: "generated", "original", or "reconstructed".
n_draws (int, default 5) – Number of resampling repetitions for each sample size.
apply_log (bool, default True) – When True, a log2(x + 1) transform is applied to the candidate and external data before evaluation.
methods (list[str] or None) – Classifier names to evaluate. Accepts canonical names ('LOGIS', 'SVM', 'KNN', 'RF', 'XGB') and common aliases ('LOGISTIC', 'LR', 'RANDOM_FOREST', 'XGBOOST'). Defaults to all five classifiers.
verbose (int or str, default "minimal") – Controls output verbosity. Accepts 0 / "silent" (no output), 1 / "minimal" (one dynamic overall progress bar across all sample sizes, draws, and methods), or 2 / "detailed" (per-draw/method metric lines).
test_data (pd.DataFrame or None) – Fixed external evaluation data. Must have the same feature columns as data. When supplied, test_groups is also required. External rows are transformed using preprocessing fitted on each candidate subset.
test_groups (array-like or None) – Class labels corresponding to the rows of test_data. Must be supplied together with test_data and use labels present in groups.
random_seed (int or None) – Seed for candidate sampling, shuffled cross-validation, and stochastic classifiers.

Returns:

Columns: total_size, draw, method, f1_score, accuracy, auc.

Return type:

pd.DataFrame

Raises:

TypeError – If data is not a pd.DataFrame or SyngResult, or supplied test_data is not a pd.DataFrame.
ValueError – If groups is missing when required, which is invalid, methods contains unknown names, sample_sizes is empty or contains non-positive values, or any sample size exceeds the number of available rows. Also raised when only one external argument is supplied or the external rows, labels, or feature columns are incompatible, or when numerical values are invalid.

Examples

Using a DataFrame:

>>> df = pd.read_csv("mydata.csv")
>>> groups = df.pop("group")
>>> result = evaluate_sample_sizes(df, sample_sizes=[50, 100], groups=groups)

Using a SyngResult:

>>> from syng_bts import generate
>>> sr = generate(data="BRCASubtypeSel_test", model="CVAE1-20", epoch=10)
>>> result = evaluate_sample_sizes(sr, sample_sizes=[50], which="generated")

Evaluating candidate data on a fixed empirical test set:

>>> result = evaluate_sample_sizes(
...     df,
...     sample_sizes=[50, 100],
...     groups=groups,
...     test_data=empirical_test,
...     test_groups=empirical_test_groups,
... )

plot_sample_sizes 

syng_bts.plot_sample_sizes(metric_real: DataFrame, metric_generated: DataFrame | None = None, metric_name: str = 'f1_score', y_limits: tuple[float, float] | None = (0.4, 1)) → matplotlib.pyplot.Figure[source]

Visualize IPLF learning curves fitted from evaluation metrics.

Fits weighted inverse power-law curves to the evaluation metrics produced by evaluate_sample_sizes() and plots observed values, fitted curves, and approximate pointwise 95% confidence intervals for the fitted mean curves. These bands are not prediction intervals. Three distinct sample sizes are sufficient to fit the curve, but at least four fitted points are required to estimate parameter covariance and display a confidence band.

The returned figure is never displayed automatically — call fig.savefig(...) or plt.show() explicitly to display or save.

Parameters:

metric_real (pd.DataFrame) – Metrics from evaluate_sample_sizes() on real data.
metric_generated (pd.DataFrame or None) – Metrics from evaluate_sample_sizes() on generated data. When provided, a second column of panels is added.
metric_name (str, default "f1_score") – Metric to visualize ("f1_score", "accuracy", or "auc").
y_limits (tuple of float or None, default (0.4, 1)) – Limits applied to the y-axis of every panel. Set to None to use Matplotlib’s automatic scaling.

Returns:

The figure containing the learning-curve panels.

Return type:

matplotlib.figure.Figure

Examples

>>> metrics = evaluate_sample_sizes(df, [50, 100, 150, 200], groups=g)
>>> fig = plot_sample_sizes(metrics)
>>> fig.savefig("learning_curves.png")

Data Utilities 

Functions for loading and managing datasets.

resolve_data 

syng_bts.resolve_data(data: DataFrame | str | Path) → tuple[DataFrame, Series | None][source]

Resolve a flexible data input to a pandas DataFrame and optional groups.

Accepts a DataFrame (returned as-is with None groups), a file path (loaded via pd.read_csv / pd.read_parquet), or the name of a bundled dataset.

Parameters:

data (pd.DataFrame, str, or Path) –

One of:

A pd.DataFrame — returned directly with groups None.
A str or Path pointing to an existing CSV or Parquet file (must include an extension such as .csv or .parquet).
A plain name (no extension, no path separators) of a bundled dataset, e.g. "SKCMPositive_4".

Returns:

(features_df, groups_or_none). Groups are a pd.Series only when the input is a bundled dataset that ships with a groups sidecar. For user-provided files and DataFrames, groups are always None.

Return type:

tuple[pd.DataFrame, pd.Series | None]

Raises:

ValueError – If data looks like a bundled-dataset name but is not found in the registry. The error message lists all available bundled datasets.
FileNotFoundError – If data looks like a file path but the file does not exist.
TypeError – If data is not a DataFrame, str, or Path.

Examples

>>> from syng_bts.data_utils import resolve_data
>>> df, groups = resolve_data("SKCMPositive_4")          # bundled
>>> df, groups = resolve_data("./my_data/custom.csv")    # file path
>>> df, groups = resolve_data(existing_dataframe)         # pass-through

list_bundled_datasets 

syng_bts.list_bundled_datasets() → list[source]

List all available bundled datasets.

Returns:: List of dataset names that can be loaded with resolve_data().
Return type:: list

TCGA Datasets 

The TCGA loader downloads, caches, and exposes 24 packaged TCGA miRNA cohorts (raw + normalized + CVAE-synthesized). See TCGA Datasets for the narrative guide and TCGA Quick Start for a runnable example.

load_tcga_dataset 

syng_bts.load_tcga_dataset(name: str, *, force: bool = False, manifest_url: str | None = None) → TCGADataset[source]

Download (if needed) and return a TCGA cohort as a TCGADataset.

On first call for a given dataset, the loader fetches the manifest, downloads the corresponding HDF5 file, verifies its sha256, and caches the file under tcga_cache_dir() / <version>. Subsequent calls reuse the cached file.

Parameters:

name – Cohort code (e.g. "BRCA") or full dataset name from the manifest. Cancer-type prefixes resolve to the canonical entry when unambiguous.
force – If True, redownload even when a cached file exists.
manifest_url – Override the published manifest URL (useful for staging mirrors). Defaults to the data-v1.0 release manifest.

Returns:

A TCGADataset exposing TCGADataset.real() and TCGADataset.synth() accessors.

Raises:

ValueError – If name does not match any cohort in the manifest, if the cached HDF5 file is corrupt (pass force=True to redownload), or if the sha256 checksum fails twice after retry.
OSError – If the manifest or HDF5 file cannot be downloaded due to a network failure.

Example

>>> from syng_bts import load_tcga_dataset
>>> ds = load_tcga_dataset("BRCA")
>>> real_df, real_groups = ds.real()
>>> real_df.shape
(1144, 570)

list_tcga_datasets 

syng_bts.list_tcga_datasets(*, short: bool = False, manifest_url: str | None = None) → list[str][source]

Return the names of all TCGA cohorts in the published manifest.

Parameters:

short – If True, return short cohort codes (e.g. "BRCA"). Otherwise return the full manifest dataset names.
manifest_url – Override the published manifest URL. Defaults to the data-v1.0 release manifest.

Returns:

A list of dataset names. With short=False (default), names are the full manifest entries; with short=True, the leading cohort code.

Example

>>> from syng_bts import list_tcga_datasets
>>> list_tcga_datasets(short=True)[:3]
['ACC', 'BLCA', 'BRCA']

tcga_cache_dir 

syng_bts.tcga_cache_dir() → Path[source]

Return the active TCGA cache directory (without the version subdir).

Honors the SYNG_BTS_CACHE_DIR environment variable if set; otherwise returns ~/.cache/syng-bts/tcga. The directory is not created by this call.

Returns:: The cache root for TCGA datasets. Versioned dataset files live under tcga_cache_dir() / <manifest-version>.

Example

>>> from syng_bts import tcga_cache_dir
>>> tcga_cache_dir()
PosixPath('/home/alice/.cache/syng-bts/tcga')

clear_tcga_cache 

syng_bts.clear_tcga_cache() → None[source]

Remove the entire TCGA cache directory.

Deletes tcga_cache_dir() recursively. The next call to load_tcga_dataset() will redownload from the manifest. Use this for full cleanup; for per-dataset redownload, prefer load_tcga_dataset() with force=True.

Example

>>> from syng_bts import clear_tcga_cache
>>> clear_tcga_cache()

TCGADataset 

class syng_bts.TCGADataset(*, name: str, cancer_type: str, clinical_variable: str, group_labels: list[str], n_raw_samples: int, n_filtered_samples: int, n_raw_features: int, n_filtered_features: int, schema_version: str, creation_date: str, syng_bts_version: str, raw: Subset, processed: dict[str, Subset], synthetic: dict[str, dict[str, Subset]])[source]

Bases: object

A loaded TCGA cohort with real and synthetic accessors.

Returned by load_tcga_dataset(). Wraps a single HDF5 file containing the raw expression matrix, three normalizations (raw_norm, TC, DESeq), and nine synthetic groups (three CVAE models × three normalizations).

Use real() to access real expression data and synth() to access a synthetic counterpart.

real(normalization: str = 'DESeq') → tuple[DataFrame, Series][source]

Return (expression, groups) for one processed normalization.

Parameters:: normalization – One of "raw_norm", "TC", or "DESeq" (default). See syng_bts.tcga.VALID_NORMALIZATIONS.
Returns:: A (expression, groups) tuple where expression is a pandas.DataFrame of shape (n_samples, n_features) and groups is a pandas.Series of group labels aligned to expression.index. To access the per-slice metadata dict, use the underlying Subset directly via dataset.processed[normalization].
Raises:: ValueError – If normalization is not in syng_bts.tcga.VALID_NORMALIZATIONS.

Example

>>> ds = load_tcga_dataset("BRCA")
>>> real_df, real_groups = ds.real("TC")
>>> real_df.shape
(1144, 570)

synth(normalization: str = 'DESeq', model: str = 'CVAE1_5') → tuple[DataFrame, Series][source]

Return (expression, groups) for one synthetic configuration.

Parameters:

normalization – One of "raw_norm", "TC", or "DESeq" (default). See syng_bts.tcga.VALID_NORMALIZATIONS.
model – One of "CVAE1_5" (default), "CVAE1_10", or "CVAE1_20". See syng_bts.tcga.VALID_MODELS.

Returns:

A (expression, groups) tuple where expression is a pandas.DataFrame of shape (n_samples, n_features) and groups is a pandas.Series of group labels aligned to expression.index. To access the per-slice metadata dict (KL weight, epochs trained, etc.), use the underlying Subset directly via dataset.synthetic[normalization][model].

Raises:

ValueError – If normalization is not in syng_bts.tcga.VALID_NORMALIZATIONS or model is not in syng_bts.tcga.VALID_MODELS.

Example

>>> ds = load_tcga_dataset("BRCA")
>>> synth_df, synth_groups = ds.synth("TC", "CVAE1_5")
>>> synth_df.shape
(1000, 570)

Subset 

class syng_bts.Subset(expression: DataFrame, groups: Series, metadata: dict[str, Any])[source]

Bases: object

An expression subset returned by TCGADataset accessors.

expression

A pandas.DataFrame of shape (n_samples, n_features). Rows are TCGA samples (indexed by sample barcode); columns are miRNA features.

Type:: pandas.DataFrame

groups

A pandas.Series aligned to expression.index with categorical group labels (e.g. tumor vs. normal).

Type:: pandas.Series

metadata

Dict of HDF5 attributes captured at dataset assembly time (e.g. version, normalization, source).

Type:: dict[str, Any]

expression: DataFrame

groups: Series

metadata: dict[str, Any]

__init__(expression: DataFrame, groups: Series, metadata: dict[str, Any]) → None

Model Classes 

Advanced users can access the model classes directly.

Note

These classes are for advanced usage. Most users should use the experiment functions (generate, pilot_study, transfer) which handle model creation and training automatically.

AE (Autoencoder)

class syng_bts.AE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features)[source]

forward(x)[source]

VAE (Variational Autoencoder)

class syng_bts.VAE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features)[source]

reparameterize(z_mu, z_log_var, deterministic=False)[source]

encoding_fn(x, deterministic=False)[source]

forward(x, deterministic=False)[source]

CVAE (Conditional VAE)

class syng_bts.CVAE(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features, num_classes, wide_network=False)[source]

reparameterize(z_mu, z_log_var, deterministic=False)[source]

encoding_fn(x, y, deterministic=False)[source]

decoding_fn(encoded, y)[source]

forward(x, y, deterministic=False)[source]

GAN (Generative Adversarial Network)

class syng_bts.GAN(*args: Any, **kwargs: Any)[source]

Bases: Module

__init__(num_features, latent_dim=32)[source]

generator_forward(z)[source]

discriminator_forward(img)[source]

Package Information 

Version and metadata information.

syng_bts.__version__: The current version of SyNG-BTS.

syng_bts.__author__: The package authors.

syng_bts.__license__: The package license (AGPL-3.0).