SWE-smith Dataset - SWE-MiniSandbox

Introduction

This document explains how to:

Cache minisandbox environments for SWE-smith instances (following their image sharing pipeline).
Validate the full SWE-smith dataset using the environments cache.
Filter passed instances.
Collect SFT trajectories and optionally balance them.

SWE-smith reuses images across instances to reduce storage. We follow the same pattern for environment preparation: venvs can be shared across instances with the same image_name.

1. Identifying Unique Environments

First, extract instances with unique images from the SWE-smith dataset using the image_name field. This avoids redundant environment setup.

basedir=/home/zeta/SWE  # change to your own basedir
cd $basedir/SWE-MiniSandbox

python data/unique_images.py \
  --datap SWE-bench/SWE-smith-py \
  --save_path $basedir/SWE-MiniSandbox/dataset/SWE-smith-unique_images

Note: The original dataset path we use in our paper is SWE-bench/SWE-smith, but it includes non-Python tasks that are not supported. Here we recommend using SWE-smith-py instead.

2. Preparing Environment Cache for Unique Images

Each unique image will have a corresponding environment cache. The cache includes:

A venv (shared across instances with the same image_name)
An optional git repo cache (not necessary for SWE-smith; by default, we cache git repos here)

Setup and Dependencies

basedir=/home/zeta/SWE  # change to your own basedir

unset PIP_CONSTRAINT
apt-get install -y graphviz

database=$basedir/SWE-MiniSandbox/dataset/SWE-smith-unique_images

conda_env=$basedir/miniconda3       # Conda with our minisandbox installed
env_dir=/home/zeta/SWE/conda        # Conda env used only for venv creation inside sandboxes
cache_dir=/home/smith               # Base directory for environment cache
output_dir=$cache_dir/out           # Directory for run results
sandbox_dir=$cache_dir/sandbox      # Root directory for sandbox environments
cached_git=$cache_dir/cached_git    # Git repo cache directory (optional)
shared_venv_dir=$cache_dir/shared_venv  # Directory for shared venvs

source $conda_env/bin/activate rl

pip install tomli tomli-w
pip install httpbin

Run Environment Preparation

sweagent run-batch --config $basedir/SWE-MiniSandbox/config/swesmith_infer.yaml \
  --instances.type swesmith \
  --env_type sandbox \
  --instances.deployment.conda_env=$env_dir \
  --instances.deployment.delete_after_create=False \
  --agent.model.api_base http://0.0.0.0:8000/v1 \
  --random_delay_multiplier=1 \
  --instances.deployment.root_base=$sandbox_dir \
  --instances.deployment.tool_path=$basedir/SWE-MiniSandbox/SWE-agent/tools \
  --instances.deployment.git_base_path=$cached_git \
  --instances.deployment.shared_venv=$shared_venv_dir \
  --output_dir $output_dir \
  --instances.path $database \
  --instances.load_from_disk True \
  --instances.start 0 \
  --instances.end -1 \
  --instances.num_rollouts_per_instance -1 \
  --num_workers 60   # adjust parallelism based on your hardware

Only instances marked as passed in SWE-smith will successfully generate environment caches.

3. Validating the Full SWE-smith Dataset

After caching environments for all unique images, you can validate the full SWE-smith dataset and reuse the cached environments.

The venv cache will automatically be reused for instances sharing the same image_name. Git repo caching is also available but not required for SWE-smith.

You can optionally pre-filter instances to only passed images, but for simplicity, this section uses the full dataset.

rm -rf $output_dir  # Clear previous output

database=SWE-bench/SWE-smith-py  # Full SWE-smith dataset
output_dir=$cache_dir/full_out   # Output directory for full run

sweagent run-batch --config $basedir/SWE-MiniSandbox/config/swesmith_infer.yaml \
  --instances.type swesmith \
  --env_type sandbox \
  --instances.deployment.conda_env=$env_dir \
  --instances.deployment.delete_after_create=False \
  --agent.model.api_base http://0.0.0.0:8000/v1 \
  --random_delay_multiplier=1 \
  --instances.deployment.root_base=$sandbox_dir \
  --instances.deployment.tool_path=$basedir/SWE-MiniSandbox/SWE-agent/tools \
  --instances.deployment.git_base_path=$cached_git \
  --instances.deployment.shared_venv=$shared_venv_dir \
  --output_dir $output_dir \
  --instances.path $database \
  --instances.load_from_disk False \  # Load from Hugging Face dataset
  --instances.start 0 \
  --instances.end -1 \
  --instances.num_rollouts_per_instance -1 \
  --num_workers 60

Again, only instances marked passed in SWE-smith will generate environment caches successfully. Instances that failed or other may do so due to environment setup issues.

To debug failed instances:

The install commands are mapped in

get_install_commands_wrapper.
The test commands are mapped in

get_test_commands_wrapper.

You can inspect logs and fix the environments manually or via an LLM-based assistant.

On subsequent runs of the same instance, the prepared environment cache will be reused.

3.5. Rebuild Some Data Items (Optional)

Some instances may raise exceptions during environment setup (e.g., due to transient network issues, venv creation errors, Git clone failures, or package problems). In such cases, you can selectively rebuild only the failed instances (see Rebuilding Failed Instances). This is necessary when validating the RL environment cache in our image. However, you do not need to care about this for the RL process, as we have implemented automatic rebuild logic.

4. Filtering `passed` Instances

Once the full dataset run is complete, you can filter out passed instances for training.

output_dir=/home/smith  # Directory containing run_batch_exit_statuses.yaml

cd $basedir/SWE-MiniSandbox

python data/filter_smith.py \
  --res_dir $output_dir \
  --dataset_path SWE-bench/SWE-smith-py \
  --output_path $basedir/SWE-MiniSandbox/dataset/smith-passed

You may delete git caches for failed instances to save storage.

5. SFT Trajectory Collection

With passed instances filtered, you can now collect SFT trajectories as golden data for training.

5.1. Model API Deployment

Serve an LLM via API or use an existing endpoint. Update agent.model.api_base and agent.model.name accordingly in your config or commands.

Example: Serving SWE-bench/SWE-agent-LM-32B with vLLM:

bashmodelp=SWE-bench/SWE-agent-LM-32B
conda activate vllm  # create/use a vLLM environment

vllm serve $modelp \
  --tensor-parallel-size 8 \
  --async-scheduling \
  --served-model-name custom

This exposes an API at http://0.0.0.0:8000/v1 with model name custom.

5.2. Golden Trajectory Collection

basedir=/home/zeta/SWE  # change to your own basedir

unset PIP_CONSTRAINT
database=$basedir/SWE-MiniSandbox/dataset/smith-passed

pip install flask

conda_env=$basedir/miniconda3
env_dir=/home/zeta/SWE/conda/
base_dir=/home/zeta/SWE
output_dir=$base_dir/out
rm -rf $output_dir  # remove the old output dir if it exists
sandbox_dir=$base_dir/sandbox
cached_git=$base_dir/cached_git
shared_venv_dir=$base_dir/shared_venv

source $conda_env/bin/activate
pip install tomli tomli-w
pip install httpbin

sweagent run-batch --config $basedir/SWE-MiniSandbox/config/swesmith_infer_default.yaml \
  --env_type sandbox \
  --instances.deployment.type sandbox \
  --instances.deployment.conda_env=$env_dir \
  --instances.deployment.delete_after_create False \
  --instances.deployment.tool_path $basedir/SWE-MiniSandbox/SWE-agent/tools \
  --agent.type sandbox \
  --agent.model.api_base http://0.0.0.0:8000/v1 \
  --agent.model.temperature 0.8 \
  --agent.step_limit 100 \   # Step limit to avoid infinite loops
  --random_delay_multiplier=1 \
  --instances.deployment.root_base=$sandbox_dir \
  --instances.deployment.git_base_path=$cached_git \
  --instances.deployment.shared_venv=$shared_venv_dir \
  --output_dir $output_dir \
  --instances.path $database \
  --instances.start 0 \
  --instances.end -1 \
  --instances.load_from_disk True \
  --num_workers 60 \
  --instances.num_rollouts_per_instance 4  # Collect 4 trajectories per instance

After the run, aggregate trajectories:

python -m swesmith.train.traj_mgr.collect_trajs \
  --traj_dir $output_dir \
  --out_path $basedir/SWE-MiniSandbox/dataset/smith-sft-trajs/dataset.jsonl

Only trajectories for submitted instances with reward 1 are kept as golden trajectories.

6. (Optional) SFT Data Balancing

The collected trajectories are often imbalanced across instance IDs: easy instances tend to have more trajectories than hard ones. You can balance the SFT dataset using the provided script.

output_yaml=$output_dir/run_batch_exit_statuses.yaml 
json_path=$basedir/SWE-MiniSandbox/dataset/smith-sft-trajs/dataset.jsonl

python /home/zeta/SWE/SWE/data/balance_data.py \
  --json_path $json_path \
  --yaml_dir $output_yaml \
  --output_path $basedir/SWE-MiniSandbox/dataset/smith-sft-trajs/balanced_sft_data.jsonl

The final balanced SFT dataset is:

$basedir/SWE-MiniSandbox/dataset/smith-sft-trajs/balanced_sft_data.jsonl