Parallel Execution with Multiple Seeds
kmeanssa-ng provides utilities for running multiple simulated
annealing instances in parallel with different random seeds. This is
useful for:
- Finding better solutions: Running multiple times increases the chance of finding the global optimum
- Statistical analysis: Analyzing the variability of clustering results across different initializations
- Reproducible experiments: Using fixed seeds for consistent results across runs
Principle
The parallel execution uses Python’s ProcessPoolExecutor to achieve
true parallelism across multiple CPU cores. Each run:
- Executes in a separate process with its own random seed
- Runs the complete simulated annealing algorithm independently
- Returns the final centers and energy
Results are collected as they complete, sorted by energy (ascending), and the best result is returned by default.
Basic Usage
The simplest way to run multiple executions in parallel:
from kmeanssa_ng import run_parallel, generate_sbm
from kmeanssa_ng.quantum_graph.sampling import UniformNodeSampling
from kmeanssa_ng.core.strategies.initialization import KMeansPlusPlus
from kmeanssa_ng.core.strategies.robustification import MinimizeEnergy
# Generate a graph with two communities
graph = generate_sbm(sizes=[50, 50], p=[[0.7, 0.1], [0.1, 0.7]])
# Run 10 times in parallel (uses all available CPU cores)
# Each run samples its own 100 points with a different seed
# Note: mp_context='fork' is needed for Jupyter/Quarto compatibility
best_centers = run_parallel(
graph,
n_points=100,
k=2,
sampling_strategy=UniformNodeSampling(),
initialization_strategy=KMeansPlusPlus(),
robustification_strategy=MinimizeEnergy(),
n_runs=10,
mp_context='fork'
)
print(f"Best result has {len(best_centers)} centers")
Best result has 2 centers
Parameters:
space: The metric space (e.g., QuantumGraph) to sample points fromn_points: Number of points to sample for each runk: Number of clustersn_runs: Number of parallel executions (default: 10)n_jobs: Number of worker processes (default: -1 = all cores)algorithm:"interleaved"(default) or"sequential"- Other parameters are passed to
SimulatedAnnealing(lambda0, beta0, step_size, robust_prop)
Note: Each run samples its own observations, ensuring complete independence between runs.
Retrieving All Results
To analyze the distribution of results across different seeds:
best_centers, all_results = run_parallel(
graph,
n_points=100,
k=2,
sampling_strategy=UniformNodeSampling(),
initialization_strategy=KMeansPlusPlus(),
robustification_strategy=MinimizeEnergy(),
n_runs=20,
return_all=True,
mp_context='fork'
)
# all_results is a list of (centers, energy, seed) sorted by energy
print(f"Best energy: {all_results[0][1]:.4f}")
print(f"Worst energy: {all_results[-1][1]:.4f}")
# Analyze energy distribution
import numpy as np
energies = [energy for _, energy, _ in all_results]
print(f"Mean energy: {np.mean(energies):.4f}")
print(f"Std energy: {np.std(energies):.4f}")
Best energy: 0.9953
Worst energy: 2.3519
Mean energy: 1.7327
Std energy: 0.3083
Progress Monitoring
For long-running experiments, use run_parallel_with_callback to
monitor progress:
from kmeanssa_ng import run_parallel_with_callback
def show_progress(run_idx, seed, energy):
print(f"✓ Run {run_idx+1}/20 completed - Energy: {energy:.4f} (seed={seed})")
centers = run_parallel_with_callback(
graph,
n_points=100,
k=2,
sampling_strategy=UniformNodeSampling(),
initialization_strategy=KMeansPlusPlus(),
robustification_strategy=MinimizeEnergy(),
n_runs=20,
callback=show_progress,
mp_context='fork'
)
✓ Run 1/20 completed - Energy: 2.1318 (seed=1254015729)
✓ Run 3/20 completed - Energy: 1.6800 (seed=72255520)
✓ Run 8/20 completed - Energy: 2.1800 (seed=1598036577)
✓ Run 2/20 completed - Energy: 1.8600 (seed=136511159)
✓ Run 11/20 completed - Energy: 1.8900 (seed=1276180632)
✓ Run 12/20 completed - Energy: 1.7300 (seed=549712174)
✓ Run 7/20 completed - Energy: 1.9604 (seed=104885592)
✓ Run 13/20 completed - Energy: 0.9723 (seed=880479386)
✓ Run 10/20 completed - Energy: 1.9779 (seed=716260182)
✓ Run 14/20 completed - Energy: 2.4033 (seed=1712815882)
✓ Run 4/20 completed - Energy: 1.7400 (seed=1147572173)
✓ Run 15/20 completed - Energy: 1.7900 (seed=973758642)
✓ Run 6/20 completed - Energy: 1.5500 (seed=1392537398)
✓ Run 16/20 completed - Energy: 1.7700 (seed=1709002607)
✓ Run 17/20 completed - Energy: 2.0923 (seed=684444004)
✓ Run 18/20 completed - Energy: 1.8186 (seed=510707922)
✓ Run 19/20 completed - Energy: 1.2153 (seed=1665302718)
✓ Run 20/20 completed - Energy: 1.9956 (seed=1850121183)
✓ Run 5/20 completed - Energy: 2.2300 (seed=414769320)
✓ Run 9/20 completed - Energy: 1.6500 (seed=2060771684)
The callback function receives three arguments:
run_idx: Index of the completed run (0 to n_runs-1)seed: Random seed used for this runenergy: Final energy (k-means objective) for this run
Reproducible Experiments
To ensure reproducibility, provide a fixed list of seeds:
# Define specific seeds
seeds = [42, 123, 456, 789, 101112]
# These runs will produce consistent results
result1 = run_parallel(graph, n_points=100, k=2, sampling_strategy=UniformNodeSampling(), initialization_strategy=KMeansPlusPlus(), robustification_strategy=MinimizeEnergy(), n_runs=5, seeds=seeds, mp_context='fork')
result2 = run_parallel(graph, n_points=100, k=2, sampling_strategy=UniformNodeSampling(), initialization_strategy=KMeansPlusPlus(), robustification_strategy=MinimizeEnergy(), n_runs=5, seeds=seeds, mp_context='fork')
# Results will be highly similar (same data, same initialization)
Note: While individual runs are deterministic given a seed, the final result may vary slightly due to the order in which parallel processes complete. However, the energy distribution should be consistent.
Advanced Parameters
You can pass all SimulatedAnnealing parameters:
centers = run_parallel(
graph,
n_points=150,
k=3,
sampling_strategy=UniformNodeSampling(),
initialization_strategy=KMeansPlusPlus(),
robustification_strategy=MinimizeEnergy(),
n_runs=50,
# SimulatedAnnealing parameters
lambda0=2, # Poisson intensity
beta0=0.5, # Inverse temperature
step_size=0.05, # Time step
robust_prop=0.1, # Use 10% of observations for robustification
# Parallel execution parameters
algorithm="sequential",
n_jobs=4, # Use only 4 cores
seeds=list(range(100, 150)), # Specific seeds
mp_context='fork' # For Jupyter/Quarto compatibility
)
Multiprocessing Context
The mp_context parameter controls how Python creates worker processes:
None(default): Uses the system default (spawnon macOS/Windows,forkon Linux)'fork': Fast process creation, shares memory (Unix only). Required for Jupyter/Quarto.'spawn': Safer isolation, slower startup. Recommended for production code.'forkserver': Hybrid approach (Unix only)
For production scripts, omit mp_context to use the safer default.
For documentation/notebooks, use mp_context='fork' to avoid
serialization issues.
Performance Considerations
- Speedup: With
ncores andn_runs >= n, expect near-linear speedup (e.g., 4x faster with 4 cores) - Memory: Each process has its own copy of the data. For large graphs, this may consume significant memory
- Overhead: Process creation has overhead. For very short runs, parallel execution may not be faster
- Optimal n_runs: Use at least as many runs as you have cores for good CPU utilization
Example: Statistical Analysis
Here’s a complete example analyzing clustering stability:
from kmeanssa_ng import run_parallel, generate_sbm
import numpy as np
# Generate graph
graph = generate_sbm(sizes=[50, 50], p=[[0.8, 0.1], [0.1, 0.8]])
# Run 100 times, each sampling 150 points
best, all_results = run_parallel(
graph,
n_points=150,
k=2,
sampling_strategy=UniformNodeSampling(),
initialization_strategy=KMeansPlusPlus(),
robustification_strategy=MinimizeEnergy(),
n_runs=100,
return_all=True,
mp_context='fork'
)
# Analyze results
energies = [energy for _, energy, _ in all_results]
print(f"Energy statistics:")
print(f" Min: {np.min(energies):.4f}")
print(f" Median: {np.median(energies):.4f}")
print(f" Max: {np.max(energies):.4f}")
print(f" Std: {np.std(energies):.4f}")
# How often do we get the best result?
best_energy = energies[0]
n_optimal = sum(1 for e in energies if abs(e - best_energy) < 0.01)
print(f"\nFound optimal solution in {n_optimal}/100 runs ({n_optimal}%)")
Energy statistics:
Min: 0.6311
Median: 1.5400
Max: 2.4133
Std: 0.2908
Found optimal solution in 1/100 runs (1%)
API Reference
For detailed API documentation, see: