Introduction
SWE-bench Verified uses a one-to-one mapping between each instance and its environment image: every instance has its own unique environment. This makes environment cache preparation simpler than in SWE-smith.
Validating the Full Dataset
You can run the full SWE-bench Verified dataset to generate the environment cache. A separate virtual environment and GitHub repository cache will be created for each instance.
Because this setup assumes a restricted network environment (where only whitelisted URLs are accessible on our GPU machines), some instances that require external network access must be handled specially:
- A local HTTP proxy server (
httpbin) is started to serve requests that would otherwise go to some public url. - Some required resources (e.g., qhull source code) are pre-downloaded and copied locally.
- The original SWE-bench Verified environment installation commands are modified to be compatible with this restricted environment.
Environment Cache Preparation
Use the following command to prepare the cache:
unset PIP_CONSTRAINT
database=/upfs/xxx/SWE/SWE/dataset/SWE-bench/SWE-bench_Verified/data # path to your SWE-bench Verified dataset
basedir=/home/zeta/SWE # change to your own base dir
apt-get update
apt-get install -y inkscape
apt-get install -y graphviz
mkdir -p /root/.cache/pip/wheels
zip_to_copy=$basedir/SWE-MiniSandbox/zip/qhull-2020-src-8.0.2.tgz
if [ ! -f /tmp/qhull-2020-src-8.0.2.tgz ]; then
cp "$zip_to_copy" /tmp/qhull-2020-src-8.0.2.tgz
fi
conda_env=$basedir/miniconda3
env_dir=/home/zeta/SWE/conda/
base_dir=/home/swebench
output_dir=/upfs/xxx/SWE/swebench_cache/out/swebench_cacheout
rm -rf "$output_dir"
sandbox_dir=$base_dir/sandbox
cached_git=$base_dir/cached_git
shared_venv_dir=$base_dir/shared_venv
source "$conda_env/bin/activate" rl
pip install tomli tomli-w
pip install httpbin
python -m httpbin.core --port=5000 &
HTTPBIN_PID=$!
sleep 2
# Ensure httpbin is killed when the script exits
trap "kill $HTTPBIN_PID" EXIT
sweagent run-batch --config "$basedir/SWE-MiniSandbox/config/sweagent_infer.yaml" \
--env_type sandbox \
--instances.deployment.conda_env="$env_dir" \
--agent.model.api_base http://0.0.0.0:8000/v1 \
--random_delay_multiplier=1 \
--instances.deployment.root_base="$sandbox_dir" \
--instances.deployment.tool_path="$basedir/SWE-MiniSandbox/SWE-agent/tools" \
--instances.deployment.git_base_path="$cached_git" \
--instances.deployment.shared_venv="$shared_venv_dir" \
--output_dir "$output_dir" \
--instances.database "$database" \
--instances.start 0 \
--instances.end 500 \
--num_workers 10
kill $HTTPBIN_PID
The environment cache is considered successfully prepared only if all instances are marked as "passed" in the output logs.
Rebuilding Failed Instances
Some instances may fail during environment setup (e.g., due to transient network issues, venv creation errors, Git clone failures, or package problems). In that case, you can selectively rebuild only the failed instances.
Assume sphinx-doc__sphinx-7757 and django__django-12125 failed in a previous run. You can rebuild them with:
# Same setup (env vars, httpbin, etc.) as above
sweagent run-batch --config "$basedir/SWE-MiniSandbox/config/sweagent_infer.yaml" \
--env_type sandbox \
--instances.deployment.conda_env="$env_dir" \
--agent.model.api_base http://0.0.0.0:8000/v1 \
--random_delay_multiplier=1 \
--instances.deployment.root_base="$sandbox_dir" \
--instances.deployment.tool_path="$basedir/SWE-MiniSandbox/SWE-agent/tools" \
--instances.deployment.git_base_path="$cached_git" \
--instances.deployment.shared_venv="$shared_venv_dir" \
--instances.deployment.eval_timeout=500 \ # Some cases fails due to not enough evaluation time limit. The default is 300
--output_dir "$output_dir" \
--instances.database "$database" \
--instances.deployment.force_rebuild=True \ # rebuild only specified instances specified below, if you want to reuse the cache, do not set it True
--instances.filter "^(sphinx-doc__sphinx-7757|django__django-12125)$"
kill $HTTPBIN_PID # do not forget to kill the httpbin server
In our image, we have guaranteed that all instances in SWE-bench Verified can be successfully cached (We recommand checking it again by yourself. To reuse the cache, --instances.deployment.force_rebuild must be set to False). However, not all SWE-bench Verified instances are guaranteed to succeed in your own image. Some may still fail due to:
- Network limitations
- Incompatible or broken pip packages
- Project-specific build or test issues
You can inspect the logs to debug these failures manually or with LLM assistance.
The mapping from an instance ID to its environment setup commands is defined in
This function returns a TestSpec object, which contains all information necessary for environment setup (install commands, test commands, etc.). The detailed mapping of environment installation and test commands is specified in
SWE-bench Verified Evaluation with MiniSandbox
Once the environment cache has been prepared, you can run evaluation with SWE-MiniSandbox.
Direct Model Evaluation
First, serve your model. For example:
modelp=SWE-bench/SWE-agent-LM-32B
conda activate vllm # use a conda env with vllm installed
vllm serve "$modelp" \
--tensor-parallel-size 8 \
--async-scheduling \
--served-model-name custom
Then, run the evaluation. This command runs SWE-agent over the dataset and produces a results.json file that includes the resolved rate:
# Same setup (env vars, httpbin, etc.) as in the cache preparation section
sweagent run-batch --config "$basedir/SWE-MiniSandbox/config/sweagent_infer_default.yaml" \
--env_type sandbox \
--instances.deployment.conda_env="$env_dir" \
--agent.model.api_base http://0.0.0.0:8000/v1 \
--agent.model.name custom \
--agent.eval True \
--random_delay_multiplier=1 \
--instances.deployment.root_base="$sandbox_dir" \
--instances.deployment.tool_path="$basedir/SWE-MiniSandbox/SWE-agent/tools" \
--instances.deployment.git_base_path="$cached_git" \
--instances.deployment.shared_venv="$shared_venv_dir" \
--output_dir "$output_dir" \
--instances.database "$database" \
--instances.start 0 \
--instances.end 500 \
--num_workers 10
kill $HTTPBIN_PID # do not forget to kill the httpbin server
python "$basedir/SWE-MiniSandbox/sh/eval.py" \
--pred_file_path "$output_dir/results.json"
# The resolved rate will be saved in $output_dir/results.json
Evaluation from Patch Submissions
If you already have a set of patch submissions (e.g., from a previous run or a different model), you can evaluate them directly. Suppose your patch file is saved as $output_dir/preds.json (this is usually produced by sweagent run-batch).
Run:
# Same setup (env vars, httpbin, etc.) as before
sweagent run-batch --config "$basedir/SWE-MiniSandbox/config/sweagent_infer.yaml" \
--env_type sandbox \
--instances.deployment.conda_env="$env_dir" \
--agent.model.api_base http://0.0.0.0:8000/v1 \
--random_delay_multiplier=1 \
--instances.deployment.root_base="$sandbox_dir" \
--instances.deployment.tool_path="$basedir/SWE-MiniSandbox/SWE-agent/tools" \
--instances.deployment.git_base_path="$cached_git" \
--instances.deployment.shared_venv="$shared_venv_dir" \
--output_dir "$output_dir" \
--instances.database "$database" \
--instances.model_patch_file "$output_dir/preds.json"
kill $HTTPBIN_PID # do not forget to kill the httpbin server
python "$basedir/SWE-MiniSandbox/sh/eval.py" \
--pred_file_path "$output_dir/results.json"
This will apply the patches from preds.json to the cached environments, run the tests, and produce evaluation results in the output directory.