676
Vulnerabilities
120
FP Traps
26
Repositories
20,062
Python LOC
18
Scanners Tested
Scanner Leaderboard ranked by F3 Score (strict)

F3 Score (0–100) measures how well a scanner finds vulnerabilities. It rewards finding real issues (recall) 9× more than avoiding false alarms (precision) — because in high-risk industries, missing a real vulnerability is far worse than a false positive. Strict mode penalizes scanners that fail or time out.

How scores work
Recall — What percentage of real vulnerabilities did the scanner find? Higher is better. A scanner that misses nothing has 100% recall.

Precision — Of everything the scanner flagged, what percentage were actual vulnerabilities? Higher is better. A scanner with no false alarms has 100% precision.

F2 Score — Combines recall and precision with beta=2 (recall weighted 4x). Range 0–100.

F3 Score — Combines recall and precision with beta=3 (recall weighted 9x). Our primary metric, designed for high-risk industries where missing a vulnerability is unacceptable. Range 0–100.

Optimistic vs StrictOptimistic only scores repos where the scanner produced results. Strict penalizes failed/timed-out repos by counting all their vulnerabilities as missed (FN). Toggle between them with the buttons below.
1
kolega-v0.0.1 26 repos
73.0
80.9% recall · 38.8% prec
2
glm-5.1-agentic-v1 25/26 repos
58.2
stdev 15.2
56.9% recall · 73.4% prec · $0.16/repo · ~$53/100k LOC · 438s avg
3
kimi-k2.6-agentic-v1 25/26 repos
54.6
stdev 14.2
52.9% recall · 77.1% prec · $0.10/repo · ~$32/100k LOC · 603s avg
4
claude-sonnet-4-6-agentic-v1 23/26 repos
51.7
stdev 13.1
49.9% recall · 78.5% prec · $0.29/repo · ~$83/100k LOC · 367s avg
5
gemini-3.1-pro-agentic-v1 24/26 repos
51.0
stdev 13.2
49.1% recall · 77.4% prec · $0.38/repo · ~$136/100k LOC · 170s avg
6
claude-opus-4-7-agentic-v1 25/26 repos
48.6
stdev 16.8
46.9% recall · 71.2% prec · $0.49/repo · ~$184/100k LOC · 76s avg
7
claude-opus-4-6-agentic-v1 19/26 repos
47.7
stdev 12.7
45.6% recall · 79.0% prec · $0.49/repo · ~$123/100k LOC · 763s avg
8
kimi-k2.5-agentic-v1 24/26 repos
46.6
stdev 12.8
45.0% recall · 68.3% prec · $0.03/repo · ~$11/100k LOC · 140s avg
9
glm-5-agentic-v1 22/26 repos
45.8
stdev 13.7
43.9% recall · 75.3% prec · $0.11/repo · ~$34/100k LOC · 409s avg
10
minimax-m2.7-agentic-v1 22/26 repos
39.0
stdev 10.8
37.1% recall · 70.7% prec · $0.02/repo · ~$8/100k LOC · 119s avg
11
qwen-3.5-397b-agentic-v1 24/26 repos
38.1
stdev 12.5
36.6% recall · 61.8% prec · $0.05/repo · ~$16/100k LOC · 77s avg
12
claude-haiku-4-5-agentic-v1 24/26 repos
37.2
stdev 12.4
35.2% recall · 74.8% prec · $0.07/repo · ~$26/100k LOC · 56s avg
13
grok-4.20-reasoning-agentic-v1 24/26 repos
28.4
stdev 16.0
26.3% recall · 92.7% prec · $0.23/repo · ~$84/100k LOC · 34s avg
14
claude-haiku-4-5-v1 23/26 repos
25.7
stdev 17.6
24.4% recall · 47.7% prec · $0.07/repo · ~$25/100k LOC · 19s avg
15
grok-3-agentic-v1 21/26 repos
21.3
stdev 21.1
19.7% recall · 83.7% prec · $0.08/repo · ~$28/100k LOC · 34s avg
16
semgrep 25/26 repos
17.7
17.5% recall · 20.5% prec
17
snyk 25/26 repos
17.4
16.7% recall · 28.2% prec
18
sonarqube 26 repos
7.1
6.5% recall · 61.1% prec
Precision vs Recall each dot is a scanner · dashed lines are F2 iso-curves
Finding Breakdown TP / FP / FN per scanner
CWE Detection Coverage across all scanners
Insecure Deserialization
18/18 scanners · 80% avg recall
SQL Injection
18/18 scanners · 79% avg recall
Code Injection / RFI
18/18 scanners · 73% avg recall
Command / OS Injection
18/18 scanners · 72% avg recall
XML External Entities
17/18 scanners · 72% avg recall
XPath Injection
15/18 scanners · 71% avg recall
Open Redirect
18/18 scanners · 70% avg recall
HTTP Header Injection
16/18 scanners · 66% avg recall
Path Traversal
18/18 scanners · 63% avg recall
Server-Side Request Forgery
18/18 scanners · 51% avg recall
Hardcoded Credentials
18/18 scanners · 50% avg recall
Broken Access Control / IDOR
15/18 scanners · 44% avg recall
Cross-Site Scripting
18/18 scanners · 43% avg recall
Security Misconfiguration
17/18 scanners · 31% avg recall
Missing Authentication / Authorization
15/18 scanners · 30% avg recall
Other
18/18 scanners · 27% avg recall
Denial of Service
13/18 scanners · 25% avg recall
Sensitive Data Exposure
15/18 scanners · 20% avg recall
Per-Repository Heatmap F2 Score · click headers to sort
Repository claude-haiku-4-5-agentic-v1 claude-haiku-4-5-v1 claude-opus-4-6-agentic-v1 claude-opus-4-7-agentic-v1 claude-sonnet-4-6-agentic-v1 gemini-3.1-pro-agentic-v1 glm-5-agentic-v1 glm-5.1-agentic-v1 grok-3-agentic-v1 grok-4.20-reasoning-agentic-v1 kimi-k2.5-agentic-v1 kimi-k2.6-agentic-v1 kolega-v0.0.1 minimax-m2.7-agentic-v1 qwen-3.5-397b-agentic-v1 semgrep snyk sonarqube
damn-vulnerable-flask-application 47.5
damn-vulnerable-flask-application / claude-haiku-4-5-agentic-v1
F2 Score48.4
F3 Score47.5
Recall46.7%
Precision57.1%
TP7
FP5
FN8
42.9
damn-vulnerable-flask-application / claude-haiku-4-5-v1
F2 Score43.6
F3 Score42.9
Recall42.2%
Precision51.0%
TP6
FP6
FN9
77.4
damn-vulnerable-flask-application / claude-opus-4-6-agentic-v1
F2 Score77.0
F3 Score77.4
Recall77.8%
Precision74.3%
TP12
FP4
FN3
62.4
damn-vulnerable-flask-application / claude-opus-4-7-agentic-v1
F2 Score62.5
F3 Score62.4
Recall62.2%
Precision64.6%
TP9
FP5
FN6
69.6
damn-vulnerable-flask-application / claude-sonnet-4-6-agentic-v1
F2 Score70.3
F3 Score69.6
Recall68.9%
Precision77.3%
TP10
FP3
FN5
58.8
damn-vulnerable-flask-application / gemini-3.1-pro-agentic-v1
F2 Score59.9
F3 Score58.8
Recall57.8%
Precision75.9%
TP9
FP2
FN6
55.1
damn-vulnerable-flask-application / glm-5-agentic-v1
F2 Score57.0
F3 Score55.1
Recall53.3%
Precision79.8%
TP8
FP2
FN7
88.8
damn-vulnerable-flask-application / glm-5.1-agentic-v1
F2 Score87.7
F3 Score88.8
Recall90.0%
Precision79.9%
TP14
FP4
FN2
33.2
damn-vulnerable-flask-application / grok-3-agentic-v1
F2 Score35.5
F3 Score33.2
Recall31.1%
Precision82.2%
TP5
FP1
FN10
37.9
damn-vulnerable-flask-application / grok-4.20-reasoning-agentic-v1
F2 Score40.6
F3 Score37.9
Recall35.6%
Precision93.3%
TP5
FP0
FN10
48.6
damn-vulnerable-flask-application / kimi-k2.5-agentic-v1
F2 Score50.7
F3 Score48.6
Recall46.7%
Precision77.8%
TP7
FP2
FN8
76.0
damn-vulnerable-flask-application / kimi-k2.6-agentic-v1
F2 Score76.4
F3 Score76.0
Recall75.6%
Precision82.6%
TP11
FP2
FN4
67.0
damn-vulnerable-flask-application / kolega-v0.0.1
F2 Score57.7
F3 Score67.0
Recall80.0%
Precision27.3%
TP12
FP32
FN3
44.3
damn-vulnerable-flask-application / minimax-m2.7-agentic-v1
F2 Score45.4
F3 Score44.3
Recall43.3%
Precision70.0%
TP6
FP4
FN8
45.9
damn-vulnerable-flask-application / qwen-3.5-397b-agentic-v1
F2 Score47.4
F3 Score45.9
Recall44.4%
Precision65.8%
TP7
FP3
FN8
32.1
damn-vulnerable-flask-application / semgrep
F2 Score30.9
F3 Score32.1
Recall33.3%
Precision23.8%
TP5
FP16
FN10
28.4
damn-vulnerable-flask-application / snyk
F2 Score30.3
F3 Score28.4
Recall26.7%
Precision66.7%
TP4
FP2
FN11
0.0
damn-vulnerable-flask-application / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN15
damn-vulnerable-graphql-application 26.8
damn-vulnerable-graphql-application / claude-haiku-4-5-agentic-v1
F2 Score27.9
F3 Score26.8
Recall25.7%
Precision44.7%
TP9
FP11
FN26
9.0
damn-vulnerable-graphql-application / claude-haiku-4-5-v1
F2 Score9.4
F3 Score9.0
Recall8.6%
Precision16.1%
TP3
FP15
FN32
45.5
damn-vulnerable-graphql-application / claude-opus-4-6-agentic-v1
F2 Score46.8
F3 Score45.5
Recall44.3%
Precision60.8%
TP16
FP10
FN20
38.9
damn-vulnerable-graphql-application / claude-opus-4-7-agentic-v1
F2 Score40.8
F3 Score38.9
Recall37.1%
Precision67.1%
TP13
FP6
FN22
40.8
damn-vulnerable-graphql-application / claude-sonnet-4-6-agentic-v1
F2 Score42.7
F3 Score40.8
Recall39.1%
Precision67.4%
TP14
FP7
FN21
40.7
damn-vulnerable-graphql-application / gemini-3.1-pro-agentic-v1
F2 Score42.4
F3 Score40.7
Recall39.1%
Precision65.1%
TP14
FP7
FN21
42.5
damn-vulnerable-graphql-application / glm-5-agentic-v1
F2 Score42.1
F3 Score42.5
Recall42.9%
Precision39.5%
TP15
FP23
FN20
46.0
damn-vulnerable-graphql-application / glm-5.1-agentic-v1
F2 Score46.4
F3 Score46.0
Recall45.7%
Precision52.5%
TP16
FP16
FN19
20.7
damn-vulnerable-graphql-application / grok-4.20-reasoning-agentic-v1
F2 Score22.6
F3 Score20.7
Recall19.1%
Precision91.1%
TP7
FP1
FN28
38.4
damn-vulnerable-graphql-application / kimi-k2.5-agentic-v1
F2 Score39.7
F3 Score38.4
Recall37.1%
Precision55.1%
TP13
FP11
FN22
44.5
damn-vulnerable-graphql-application / kimi-k2.6-agentic-v1
F2 Score46.4
F3 Score44.5
Recall42.9%
Precision78.3%
TP15
FP6
FN20
59.7
damn-vulnerable-graphql-application / kolega-v0.0.1
F2 Score54.8
F3 Score59.7
Recall65.7%
Precision32.9%
TP23
FP47
FN12
38.0
damn-vulnerable-graphql-application / minimax-m2.7-agentic-v1
F2 Score39.0
F3 Score38.0
Recall37.1%
Precision49.9%
TP13
FP14
FN22
35.8
damn-vulnerable-graphql-application / qwen-3.5-397b-agentic-v1
F2 Score35.7
F3 Score35.8
Recall36.2%
Precision46.4%
TP13
FP23
FN22
6.2
damn-vulnerable-graphql-application / semgrep
F2 Score6.7
F3 Score6.2
Recall5.7%
Precision22.2%
TP2
FP7
FN33
0.0
damn-vulnerable-graphql-application / snyk
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP5
FN35
0.0
damn-vulnerable-graphql-application / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN35
djangoat 26.3
djangoat / claude-haiku-4-5-agentic-v1
F2 Score28.2
F3 Score26.3
Recall24.7%
Precision66.4%
TP12
FP6
FN38
8.5
djangoat / claude-haiku-4-5-v1
F2 Score9.1
F3 Score8.5
Recall8.0%
Precision21.7%
TP4
FP14
FN46
44.2
djangoat / claude-opus-4-6-agentic-v1
F2 Score46.7
F3 Score44.2
Recall42.0%
Precision84.0%
TP21
FP4
FN29
40.7
djangoat / claude-opus-4-7-agentic-v1
F2 Score42.3
F3 Score40.7
Recall39.3%
Precision62.6%
TP20
FP13
FN30
35.7
djangoat / claude-sonnet-4-6-agentic-v1
F2 Score37.7
F3 Score35.7
Recall34.0%
Precision66.3%
TP17
FP9
FN33
31.0
djangoat / gemini-3.1-pro-agentic-v1
F2 Score32.8
F3 Score31.0
Recall29.3%
Precision65.8%
TP15
FP9
FN35
33.2
djangoat / glm-5-agentic-v1
F2 Score35.2
F3 Score33.2
Recall31.3%
Precision69.1%
TP16
FP7
FN34
38.7
djangoat / glm-5.1-agentic-v1
F2 Score40.5
F3 Score38.7
Recall37.0%
Precision66.6%
TP18
FP10
FN32
10.2
djangoat / grok-3-agentic-v1
F2 Score11.3
F3 Score10.2
Recall9.3%
Precision73.8%
TP5
FP2
FN45
17.5
djangoat / grok-4.20-reasoning-agentic-v1
F2 Score19.2
F3 Score17.5
Recall16.0%
Precision100.0%
TP8
FP0
FN42
33.5
djangoat / kimi-k2.5-agentic-v1
F2 Score35.1
F3 Score33.5
Recall32.0%
Precision60.9%
TP16
FP12
FN34
42.4
djangoat / kimi-k2.6-agentic-v1
F2 Score44.0
F3 Score42.4
Recall41.0%
Precision62.1%
TP20
FP12
FN30
62.1
djangoat / kolega-v0.0.1
F2 Score54.5
F3 Score62.1
Recall72.0%
Precision27.7%
TP36
FP94
FN14
26.2
djangoat / qwen-3.5-397b-agentic-v1
F2 Score28.0
F3 Score26.2
Recall24.7%
Precision62.5%
TP12
FP8
FN38
20.0
djangoat / semgrep
F2 Score20.1
F3 Score20.0
Recall20.0%
Precision20.4%
TP10
FP39
FN40
18.9
djangoat / snyk
F2 Score19.9
F3 Score18.9
Recall18.0%
Precision34.6%
TP9
FP17
FN41
0.0
djangoat / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN50
dsvpwa 39.9
dsvpwa / claude-haiku-4-5-agentic-v1
F2 Score42.6
F3 Score39.9
Recall37.5%
Precision92.7%
TP12
FP1
FN20
27.5
dsvpwa / claude-haiku-4-5-v1
F2 Score29.1
F3 Score27.5
Recall26.0%
Precision55.0%
TP8
FP8
FN24
63.2
dsvpwa / claude-opus-4-7-agentic-v1
F2 Score63.9
F3 Score63.2
Recall62.5%
Precision70.7%
TP20
FP8
FN12
56.8
dsvpwa / claude-sonnet-4-6-agentic-v1
F2 Score58.5
F3 Score56.8
Recall55.2%
Precision77.2%
TP18
FP5
FN14
56.9
dsvpwa / gemini-3.1-pro-agentic-v1
F2 Score58.8
F3 Score56.9
Recall55.2%
Precision79.6%
TP18
FP4
FN14
67.3
dsvpwa / glm-5-agentic-v1
F2 Score69.2
F3 Score67.3
Recall65.6%
Precision88.8%
TP21
FP3
FN11
77.4
dsvpwa / glm-5.1-agentic-v1
F2 Score77.7
F3 Score77.4
Recall77.1%
Precision80.5%
TP25
FP6
FN7
67.3
dsvpwa / grok-3-agentic-v1
F2 Score69.1
F3 Score67.3
Recall65.6%
Precision87.5%
TP21
FP3
FN11
67.3
dsvpwa / grok-4.20-reasoning-agentic-v1
F2 Score69.1
F3 Score67.3
Recall65.6%
Precision87.5%
TP21
FP3
FN11
54.0
dsvpwa / kimi-k2.5-agentic-v1
F2 Score56.1
F3 Score54.0
Recall52.1%
Precision81.1%
TP17
FP4
FN15
80.0
dsvpwa / kolega-v0.0.1
F2 Score73.7
F3 Score80.0
Recall87.5%
Precision45.2%
TP28
FP34
FN4
53.5
dsvpwa / minimax-m2.7-agentic-v1
F2 Score55.7
F3 Score53.5
Recall51.6%
Precision86.6%
TP16
FP2
FN16
37.4
dsvpwa / qwen-3.5-397b-agentic-v1
F2 Score39.6
F3 Score37.4
Recall35.4%
Precision78.3%
TP11
FP4
FN21
19.9
dsvpwa / semgrep
F2 Score21.3
F3 Score19.9
Recall18.8%
Precision46.2%
TP6
FP7
FN26
10.2
dsvpwa / snyk
F2 Score11.1
F3 Score10.2
Recall9.4%
Precision42.9%
TP3
FP4
FN29
0.0
dsvpwa / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN32
dsvw 51.9
dsvw / claude-haiku-4-5-agentic-v1
F2 Score54.8
F3 Score51.9
Recall49.4%
Precision97.4%
TP13
FP0
FN14
25.7
dsvw / claude-haiku-4-5-v1
F2 Score26.8
F3 Score25.7
Recall24.7%
Precision41.2%
TP7
FP10
FN20
71.8
dsvw / claude-opus-4-7-agentic-v1
F2 Score73.3
F3 Score71.8
Recall70.4%
Precision88.3%
TP19
FP3
FN8
75.8
dsvw / claude-sonnet-4-6-agentic-v1
F2 Score77.7
F3 Score75.8
Recall74.1%
Precision96.9%
TP20
FP1
FN7
61.2
dsvw / gemini-3.1-pro-agentic-v1
F2 Score62.0
F3 Score61.2
Recall60.5%
Precision69.2%
TP16
FP8
FN11
60.3
dsvw / glm-5-agentic-v1
F2 Score62.8
F3 Score60.3
Recall58.0%
Precision94.0%
TP16
FP1
FN11
72.9
dsvw / glm-5.1-agentic-v1
F2 Score73.6
F3 Score72.9
Recall72.2%
Precision79.5%
TP20
FP5
FN8
49.6
dsvw / grok-3-agentic-v1
F2 Score52.5
F3 Score49.6
Recall46.9%
Precision100.0%
TP13
FP0
FN14
43.3
dsvw / grok-4.20-reasoning-agentic-v1
F2 Score46.2
F3 Score43.3
Recall40.7%
Precision100.0%
TP11
FP0
FN16
63.5
dsvw / kimi-k2.5-agentic-v1
F2 Score65.5
F3 Score63.5
Recall61.7%
Precision86.7%
TP17
FP3
FN10
72.2
dsvw / kimi-k2.6-agentic-v1
F2 Score74.2
F3 Score72.2
Recall70.4%
Precision95.0%
TP19
FP1
FN8
87.7
dsvw / kolega-v0.0.1
F2 Score83.3
F3 Score87.7
Recall92.6%
Precision59.5%
TP25
FP17
FN2
57.0
dsvw / minimax-m2.7-agentic-v1
F2 Score58.6
F3 Score57.0
Recall55.6%
Precision75.1%
TP15
FP5
FN12
59.0
dsvw / qwen-3.5-397b-agentic-v1
F2 Score60.1
F3 Score59.0
Recall58.0%
Precision77.9%
TP16
FP6
FN11
0.0
dsvw / semgrep
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN27
27.3
dsvw / snyk
F2 Score28.9
F3 Score27.3
Recall25.9%
Precision53.8%
TP7
FP6
FN20
0.0
dsvw / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN27
dvblab 54.1
dvblab / claude-haiku-4-5-agentic-v1
F2 Score55.3
F3 Score54.1
Recall53.0%
Precision67.5%
TP12
FP6
FN10
37.0
dvblab / claude-haiku-4-5-v1
F2 Score37.6
F3 Score37.0
Recall36.4%
Precision44.3%
TP8
FP10
FN14
63.5
dvblab / claude-opus-4-7-agentic-v1
F2 Score65.0
F3 Score63.5
Recall62.1%
Precision80.7%
TP14
FP3
FN8
69.1
dvblab / claude-sonnet-4-6-agentic-v1
F2 Score70.1
F3 Score69.1
Recall68.2%
Precision79.3%
TP15
FP4
FN7
60.4
dvblab / gemini-3.1-pro-agentic-v1
F2 Score61.7
F3 Score60.4
Recall59.1%
Precision75.0%
TP13
FP4
FN9
63.2
dvblab / glm-5-agentic-v1
F2 Score65.2
F3 Score63.2
Recall61.4%
Precision87.1%
TP14
FP2
FN8
69.6
dvblab / glm-5.1-agentic-v1
F2 Score69.6
F3 Score69.6
Recall69.7%
Precision69.8%
TP15
FP7
FN7
46.3
dvblab / grok-3-agentic-v1
F2 Score49.0
F3 Score46.3
Recall43.9%
Precision93.0%
TP10
FP1
FN12
43.3
dvblab / grok-4.20-reasoning-agentic-v1
F2 Score46.1
F3 Score43.3
Recall40.9%
Precision97.2%
TP9
FP0
FN13
58.8
dvblab / kimi-k2.5-agentic-v1
F2 Score60.1
F3 Score58.8
Recall57.6%
Precision75.2%
TP13
FP5
FN9
47.4
dvblab / kimi-k2.6-agentic-v1
F2 Score49.5
F3 Score47.4
Recall45.5%
Precision86.6%
TP10
FP2
FN12
73.6
dvblab / kolega-v0.0.1
F2 Score64.2
F3 Score73.6
Recall86.4%
Precision31.7%
TP19
FP41
FN3
52.6
dvblab / minimax-m2.7-agentic-v1
F2 Score53.0
F3 Score52.6
Recall52.3%
Precision56.2%
TP12
FP9
FN10
61.3
dvblab / qwen-3.5-397b-agentic-v1
F2 Score63.7
F3 Score61.3
Recall59.1%
Precision93.5%
TP13
FP1
FN9
35.1
dvblab / semgrep
F2 Score33.9
F3 Score35.1
Recall36.4%
Precision26.7%
TP8
FP22
FN14
37.0
dvblab / snyk
F2 Score37.7
F3 Score37.0
Recall36.4%
Precision44.4%
TP8
FP10
FN14
0.0
dvblab / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN22
dvpwa 33.9
dvpwa / claude-haiku-4-5-agentic-v1
F2 Score36.3
F3 Score33.9
Recall31.8%
Precision84.1%
TP7
FP1
FN15
20.6
dvpwa / claude-haiku-4-5-v1
F2 Score21.7
F3 Score20.6
Recall19.7%
Precision37.7%
TP4
FP8
FN18
37.0
dvpwa / claude-opus-4-7-agentic-v1
F2 Score39.4
F3 Score37.0
Recall34.8%
Precision82.5%
TP8
FP2
FN14
51.6
dvpwa / claude-sonnet-4-6-agentic-v1
F2 Score53.4
F3 Score51.6
Recall50.0%
Precision73.3%
TP11
FP4
FN11
45.3
dvpwa / gemini-3.1-pro-agentic-v1
F2 Score46.7
F3 Score45.3
Recall43.9%
Precision65.0%
TP10
FP5
FN12
35.6
dvpwa / glm-5-agentic-v1
F2 Score37.2
F3 Score35.6
Recall34.1%
Precision61.9%
TP8
FP6
FN14
58.6
dvpwa / glm-5.1-agentic-v1
F2 Score59.6
F3 Score58.6
Recall57.6%
Precision70.2%
TP13
FP6
FN9
11.6
dvpwa / grok-3-agentic-v1
F2 Score12.9
F3 Score11.6
Recall10.6%
Precision100.0%
TP2
FP0
FN20
24.6
dvpwa / grok-4.20-reasoning-agentic-v1
F2 Score26.9
F3 Score24.6
Recall22.7%
Precision100.0%
TP5
FP0
FN17
49.8
dvpwa / kimi-k2.5-agentic-v1
F2 Score51.3
F3 Score49.8
Recall48.5%
Precision70.2%
TP11
FP5
FN11
57.5
dvpwa / kimi-k2.6-agentic-v1
F2 Score58.2
F3 Score57.5
Recall56.8%
Precision65.8%
TP12
FP7
FN10
80.2
dvpwa / kolega-v0.0.1
F2 Score74.8
F3 Score80.2
Recall86.4%
Precision48.7%
TP19
FP20
FN3
25.3
dvpwa / minimax-m2.7-agentic-v1
F2 Score26.5
F3 Score25.3
Recall24.2%
Precision52.4%
TP5
FP7
FN17
30.4
dvpwa / qwen-3.5-397b-agentic-v1
F2 Score32.2
F3 Score30.4
Recall28.8%
Precision61.6%
TP6
FP4
FN16
9.6
dvpwa / semgrep
F2 Score10.2
F3 Score9.6
Recall9.1%
Precision20.0%
TP2
FP8
FN20
5.0
dvpwa / snyk
F2 Score5.6
F3 Score5.0
Recall4.5%
Precision100.0%
TP1
FP0
FN21
0.0
dvpwa / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN22
extremely-vulnerable-flask-app 34.1
extremely-vulnerable-flask-app / claude-haiku-4-5-agentic-v1
F2 Score36.4
F3 Score34.1
Recall32.1%
Precision78.8%
TP9
FP3
FN19
27.7
extremely-vulnerable-flask-app / claude-haiku-4-5-v1
F2 Score29.4
F3 Score27.7
Recall26.2%
Precision60.8%
TP7
FP5
FN21
55.4
extremely-vulnerable-flask-app / claude-opus-4-6-agentic-v1
F2 Score57.3
F3 Score55.4
Recall53.6%
Precision79.0%
TP15
FP4
FN13
51.9
extremely-vulnerable-flask-app / claude-opus-4-7-agentic-v1
F2 Score53.8
F3 Score51.9
Recall50.0%
Precision77.8%
TP14
FP4
FN14
59.6
extremely-vulnerable-flask-app / claude-sonnet-4-6-agentic-v1
F2 Score62.2
F3 Score59.6
Recall57.1%
Precision97.1%
TP16
FP0
FN12
51.8
extremely-vulnerable-flask-app / gemini-3.1-pro-agentic-v1
F2 Score53.9
F3 Score51.8
Recall50.0%
Precision79.5%
TP14
FP4
FN14
53.5
extremely-vulnerable-flask-app / glm-5.1-agentic-v1
F2 Score55.4
F3 Score53.5
Recall51.8%
Precision76.9%
TP14
FP4
FN14
40.4
extremely-vulnerable-flask-app / grok-4.20-reasoning-agentic-v1
F2 Score42.9
F3 Score40.4
Recall38.1%
Precision90.1%
TP11
FP1
FN17
44.9
extremely-vulnerable-flask-app / kimi-k2.5-agentic-v1
F2 Score47.0
F3 Score44.9
Recall42.9%
Precision81.6%
TP12
FP3
FN16
56.6
extremely-vulnerable-flask-app / kimi-k2.6-agentic-v1
F2 Score58.5
F3 Score56.6
Recall54.8%
Precision81.4%
TP15
FP4
FN13
90.9
extremely-vulnerable-flask-app / kolega-v0.0.1
F2 Score83.3
F3 Score90.9
Recall100.0%
Precision50.0%
TP28
FP28
FN0
38.0
extremely-vulnerable-flask-app / minimax-m2.7-agentic-v1
F2 Score40.7
F3 Score38.0
Recall35.7%
Precision90.9%
TP10
FP1
FN18
28.0
extremely-vulnerable-flask-app / qwen-3.5-397b-agentic-v1
F2 Score30.1
F3 Score28.0
Recall26.2%
Precision75.9%
TP7
FP2
FN21
11.3
extremely-vulnerable-flask-app / semgrep
F2 Score11.9
F3 Score11.3
Recall10.7%
Precision21.4%
TP3
FP11
FN25
19.2
extremely-vulnerable-flask-app / snyk
F2 Score20.8
F3 Score19.2
Recall17.9%
Precision62.5%
TP5
FP3
FN23
15.5
extremely-vulnerable-flask-app / sonarqube
F2 Score16.9
F3 Score15.5
Recall14.3%
Precision66.7%
TP4
FP2
FN24
flask-xss 30.5
flask-xss / claude-haiku-4-5-agentic-v1
F2 Score32.8
F3 Score30.5
Recall28.6%
Precision81.8%
TP8
FP2
FN20
43.4
flask-xss / claude-haiku-4-5-v1
F2 Score45.4
F3 Score43.4
Recall41.7%
Precision70.9%
TP12
FP5
FN16
45.1
flask-xss / claude-opus-4-6-agentic-v1
F2 Score47.6
F3 Score45.1
Recall42.9%
Precision85.7%
TP12
FP2
FN16
26.0
flask-xss / claude-opus-4-7-agentic-v1
F2 Score27.0
F3 Score26.0
Recall25.0%
Precision84.2%
TP7
FP3
FN21
41.5
flask-xss / claude-sonnet-4-6-agentic-v1
F2 Score44.0
F3 Score41.5
Recall39.3%
Precision84.6%
TP11
FP2
FN17
44.7
flask-xss / gemini-3.1-pro-agentic-v1
F2 Score46.8
F3 Score44.7
Recall42.9%
Precision75.7%
TP12
FP4
FN16
50.8
flask-xss / glm-5-agentic-v1
F2 Score53.0
F3 Score50.8
Recall48.8%
Precision80.3%
TP14
FP3
FN14
45.2
flask-xss / glm-5.1-agentic-v1
F2 Score47.9
F3 Score45.2
Recall42.9%
Precision90.8%
TP12
FP1
FN16
15.6
flask-xss / grok-3-agentic-v1
F2 Score17.0
F3 Score15.6
Recall14.3%
Precision74.4%
TP4
FP1
FN24
19.5
flask-xss / grok-4.20-reasoning-agentic-v1
F2 Score21.3
F3 Score19.5
Recall17.9%
Precision100.0%
TP5
FP0
FN23
42.2
flask-xss / kimi-k2.5-agentic-v1
F2 Score44.2
F3 Score42.2
Recall40.5%
Precision74.5%
TP11
FP5
FN17
37.9
flask-xss / kimi-k2.6-agentic-v1
F2 Score40.3
F3 Score37.9
Recall35.7%
Precision83.3%
TP10
FP2
FN18
75.9
flask-xss / kolega-v0.0.1
F2 Score68.2
F3 Score75.9
Recall85.7%
Precision37.5%
TP24
FP40
FN4
38.9
flask-xss / minimax-m2.7-agentic-v1
F2 Score41.1
F3 Score38.9
Recall36.9%
Precision75.5%
TP10
FP3
FN18
30.5
flask-xss / qwen-3.5-397b-agentic-v1
F2 Score32.8
F3 Score30.5
Recall28.6%
Precision80.9%
TP8
FP2
FN20
11.7
flask-xss / semgrep
F2 Score12.9
F3 Score11.7
Recall10.7%
Precision75.0%
TP3
FP1
FN25
3.9
flask-xss / snyk
F2 Score4.4
F3 Score3.9
Recall3.6%
Precision50.0%
TP1
FP1
FN27
0.0
flask-xss / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN28
insecure-web 68.7
insecure-web / claude-haiku-4-5-agentic-v1
F2 Score70.9
F3 Score68.7
Recall66.7%
Precision95.2%
TP6
FP0
FN3
63.2
insecure-web / claude-haiku-4-5-v1
F2 Score63.4
F3 Score63.2
Recall63.0%
Precision65.3%
TP6
FP3
FN3
76.1
insecure-web / claude-opus-4-6-agentic-v1
F2 Score74.5
F3 Score76.1
Recall77.8%
Precision63.6%
TP7
FP4
FN2
73.5
insecure-web / claude-opus-4-7-agentic-v1
F2 Score73.0
F3 Score73.5
Recall74.1%
Precision69.4%
TP7
FP3
FN2
70.6
insecure-web / claude-sonnet-4-6-agentic-v1
F2 Score70.9
F3 Score70.6
Recall70.4%
Precision73.2%
TP6
FP2
FN3
63.5
insecure-web / gemini-3.1-pro-agentic-v1
F2 Score64.0
F3 Score63.5
Recall63.0%
Precision72.8%
TP6
FP2
FN3
72.4
insecure-web / glm-5-agentic-v1
F2 Score70.9
F3 Score72.4
Recall74.1%
Precision62.9%
TP7
FP4
FN2
77.2
insecure-web / glm-5.1-agentic-v1
F2 Score76.7
F3 Score77.2
Recall77.8%
Precision72.6%
TP7
FP3
FN2
68.7
insecure-web / grok-3-agentic-v1
F2 Score70.9
F3 Score68.7
Recall66.7%
Precision95.2%
TP6
FP0
FN3
58.1
insecure-web / grok-4.20-reasoning-agentic-v1
F2 Score61.0
F3 Score58.1
Recall55.6%
Precision100.0%
TP5
FP0
FN4
65.8
insecure-web / kimi-k2.5-agentic-v1
F2 Score65.1
F3 Score65.8
Recall66.7%
Precision64.1%
TP6
FP4
FN3
70.9
insecure-web / kimi-k2.6-agentic-v1
F2 Score71.5
F3 Score70.9
Recall70.4%
Precision76.4%
TP6
FP2
FN3
79.6
insecure-web / kolega-v0.0.1
F2 Score66.2
F3 Score79.6
Recall100.0%
Precision28.1%
TP9
FP23
FN0
60.1
insecure-web / qwen-3.5-397b-agentic-v1
F2 Score61.0
F3 Score60.1
Recall59.3%
Precision70.8%
TP5
FP2
FN4
51.5
insecure-web / semgrep
F2 Score48.1
F3 Score51.5
Recall55.6%
Precision31.2%
TP5
FP11
FN4
57.5
insecure-web / snyk
F2 Score59.5
F3 Score57.5
Recall55.6%
Precision83.3%
TP5
FP1
FN4
35.3
insecure-web / sonarqube
F2 Score37.5
F3 Score35.3
Recall33.3%
Precision75.0%
TP3
FP1
FN6
intentionally-vulnerable-python-application 63.0
intentionally-vulnerable-python-application / claude-haiku-4-5-agentic-v1
F2 Score64.2
F3 Score63.0
Recall61.9%
Precision77.1%
TP4
FP1
FN3
63.7
intentionally-vulnerable-python-application / claude-haiku-4-5-v1
F2 Score65.7
F3 Score63.7
Recall61.9%
Precision86.7%
TP4
FP1
FN3
72.5
intentionally-vulnerable-python-application / claude-opus-4-6-agentic-v1
F2 Score73.7
F3 Score72.5
Recall71.4%
Precision87.5%
TP5
FP1
FN2
81.3
intentionally-vulnerable-python-application / claude-opus-4-7-agentic-v1
F2 Score81.6
F3 Score81.3
Recall81.0%
Precision84.9%
TP6
FP1
FN1
58.8
intentionally-vulnerable-python-application / claude-sonnet-4-6-agentic-v1
F2 Score60.6
F3 Score58.8
Recall57.1%
Precision80.0%
TP4
FP1
FN3
72.5
intentionally-vulnerable-python-application / gemini-3.1-pro-agentic-v1
F2 Score73.5
F3 Score72.5
Recall71.4%
Precision83.3%
TP5
FP1
FN2
62.9
intentionally-vulnerable-python-application / glm-5-agentic-v1
F2 Score61.6
F3 Score62.9
Recall64.3%
Precision52.8%
TP4
FP4
FN2
69.5
intentionally-vulnerable-python-application / glm-5.1-agentic-v1
F2 Score67.8
F3 Score69.5
Recall71.4%
Precision58.4%
TP5
FP4
FN2
54.1
intentionally-vulnerable-python-application / grok-4.20-reasoning-agentic-v1
F2 Score56.0
F3 Score54.1
Recall52.4%
Precision78.3%
TP4
FP1
FN3
72.3
intentionally-vulnerable-python-application / kimi-k2.5-agentic-v1
F2 Score73.2
F3 Score72.3
Recall71.4%
Precision85.7%
TP5
FP1
FN2
71.9
intentionally-vulnerable-python-application / kimi-k2.6-agentic-v1
F2 Score72.5
F3 Score71.9
Recall71.4%
Precision79.4%
TP5
FP1
FN2
69.8
intentionally-vulnerable-python-application / kolega-v0.0.1
F2 Score58.8
F3 Score69.8
Recall85.7%
Precision26.1%
TP6
FP17
FN1
58.8
intentionally-vulnerable-python-application / minimax-m2.7-agentic-v1
F2 Score60.6
F3 Score58.8
Recall57.1%
Precision82.2%
TP4
FP1
FN3
58.5
intentionally-vulnerable-python-application / qwen-3.5-397b-agentic-v1
F2 Score60.0
F3 Score58.5
Recall57.1%
Precision75.6%
TP4
FP1
FN3
29.4
intentionally-vulnerable-python-application / semgrep
F2 Score30.3
F3 Score29.4
Recall28.6%
Precision40.0%
TP2
FP3
FN5
57.1
intentionally-vulnerable-python-application / snyk
F2 Score57.1
F3 Score57.1
Recall57.1%
Precision57.1%
TP4
FP3
FN3
0.0
intentionally-vulnerable-python-application / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN7
lets-be-bad-guys 54.9
lets-be-bad-guys / claude-haiku-4-5-agentic-v1
F2 Score57.2
F3 Score54.9
Recall52.8%
Precision86.7%
TP13
FP2
FN11
31.5
lets-be-bad-guys / claude-haiku-4-5-v1
F2 Score32.5
F3 Score31.5
Recall30.6%
Precision44.7%
TP7
FP10
FN17
76.8
lets-be-bad-guys / claude-opus-4-6-agentic-v1
F2 Score78.7
F3 Score76.8
Recall75.0%
Precision98.2%
TP18
FP0
FN6
50.3
lets-be-bad-guys / claude-opus-4-7-agentic-v1
F2 Score52.1
F3 Score50.3
Recall48.6%
Precision74.0%
TP12
FP4
FN12
64.7
lets-be-bad-guys / claude-sonnet-4-6-agentic-v1
F2 Score67.0
F3 Score64.7
Recall62.5%
Precision93.7%
TP15
FP1
FN9
63.4
lets-be-bad-guys / gemini-3.1-pro-agentic-v1
F2 Score65.8
F3 Score63.4
Recall61.1%
Precision95.7%
TP15
FP1
FN9
53.4
lets-be-bad-guys / glm-5-agentic-v1
F2 Score55.6
F3 Score53.4
Recall51.4%
Precision83.5%
TP12
FP3
FN12
61.8
lets-be-bad-guys / glm-5.1-agentic-v1
F2 Score64.0
F3 Score61.8
Recall59.7%
Precision90.1%
TP14
FP2
FN10
41.6
lets-be-bad-guys / grok-3-agentic-v1
F2 Score43.8
F3 Score41.6
Recall39.6%
Precision76.0%
TP10
FP3
FN14
25.5
lets-be-bad-guys / grok-4.20-reasoning-agentic-v1
F2 Score27.9
F3 Score25.5
Recall23.6%
Precision100.0%
TP6
FP0
FN18
58.1
lets-be-bad-guys / kimi-k2.5-agentic-v1
F2 Score59.3
F3 Score58.1
Recall57.0%
Precision71.9%
TP14
FP6
FN10
64.7
lets-be-bad-guys / kimi-k2.6-agentic-v1
F2 Score67.1
F3 Score64.7
Recall62.5%
Precision95.7%
TP15
FP1
FN9
88.8
lets-be-bad-guys / kolega-v0.0.1
F2 Score82.7
F3 Score88.8
Recall95.8%
Precision53.5%
TP23
FP20
FN1
45.5
lets-be-bad-guys / minimax-m2.7-agentic-v1
F2 Score47.3
F3 Score45.5
Recall43.8%
Precision70.1%
TP10
FP4
FN14
45.5
lets-be-bad-guys / qwen-3.5-397b-agentic-v1
F2 Score46.5
F3 Score45.5
Recall44.4%
Precision57.5%
TP11
FP8
FN13
38.6
lets-be-bad-guys / semgrep
F2 Score39.8
F3 Score38.6
Recall37.5%
Precision52.9%
TP9
FP8
FN15
35.4
lets-be-bad-guys / snyk
F2 Score37.7
F3 Score35.4
Recall33.3%
Precision80.0%
TP8
FP2
FN16
35.6
lets-be-bad-guys / sonarqube
F2 Score38.1
F3 Score35.6
Recall33.3%
Precision88.9%
TP8
FP1
FN16
owasp-web-playground 59.3
owasp-web-playground / claude-opus-4-7-agentic-v1
F2 Score60.1
F3 Score59.3
Recall58.6%
Precision70.1%
TP17
FP8
FN12
59.8
owasp-web-playground / glm-5.1-agentic-v1
F2 Score60.9
F3 Score59.8
Recall58.6%
Precision72.5%
TP17
FP6
FN12
67.1
owasp-web-playground / kimi-k2.6-agentic-v1
F2 Score68.8
F3 Score67.1
Recall65.5%
Precision86.2%
TP19
FP3
FN10
83.3
owasp-web-playground / kolega-v0.0.1
F2 Score75.4
F3 Score83.3
Recall93.1%
Precision42.9%
TP27
FP36
FN2
16.2
owasp-web-playground / semgrep
F2 Score13.3
F3 Score16.2
Recall20.7%
Precision5.5%
TP6
FP104
FN23
10.6
owasp-web-playground / snyk
F2 Score8.5
F3 Score10.6
Recall13.8%
Precision3.4%
TP4
FP114
FN25
0.0
owasp-web-playground / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN29
pygoat 32.7
pygoat / claude-haiku-4-5-agentic-v1
F2 Score34.7
F3 Score32.7
Recall30.9%
Precision67.5%
TP22
FP10
FN48
7.2
pygoat / claude-haiku-4-5-v1
F2 Score7.8
F3 Score7.2
Recall6.7%
Precision24.1%
TP5
FP15
FN65
51.0
pygoat / claude-opus-4-6-agentic-v1
F2 Score52.8
F3 Score51.0
Recall49.3%
Precision74.5%
TP34
FP12
FN36
58.3
pygoat / claude-opus-4-7-agentic-v1
F2 Score58.8
F3 Score58.3
Recall57.9%
Precision65.6%
TP40
FP23
FN30
48.0
pygoat / claude-sonnet-4-6-agentic-v1
F2 Score49.9
F3 Score48.0
Recall46.2%
Precision74.4%
TP32
FP11
FN38
42.3
pygoat / gemini-3.1-pro-agentic-v1
F2 Score44.3
F3 Score42.3
Recall40.5%
Precision72.9%
TP28
FP11
FN42
36.7
pygoat / glm-5-agentic-v1
F2 Score38.4
F3 Score36.7
Recall35.2%
Precision73.7%
TP25
FP14
FN45
57.7
pygoat / glm-5.1-agentic-v1
F2 Score59.0
F3 Score57.7
Recall56.4%
Precision71.8%
TP40
FP16
FN30
8.3
pygoat / grok-3-agentic-v1
F2 Score9.2
F3 Score8.3
Recall7.6%
Precision49.4%
TP5
FP4
FN65
10.4
pygoat / grok-4.20-reasoning-agentic-v1
F2 Score11.5
F3 Score10.4
Recall9.5%
Precision100.0%
TP7
FP0
FN63
35.9
pygoat / kimi-k2.5-agentic-v1
F2 Score37.3
F3 Score35.9
Recall34.8%
Precision57.2%
TP24
FP21
FN46
46.4
pygoat / kimi-k2.6-agentic-v1
F2 Score48.1
F3 Score46.4
Recall44.8%
Precision70.5%
TP31
FP13
FN39
62.7
pygoat / kolega-v0.0.1
F2 Score57.7
F3 Score62.7
Recall68.6%
Precision35.3%
TP48
FP88
FN22
39.8
pygoat / minimax-m2.7-agentic-v1
F2 Score41.6
F3 Score39.8
Recall38.1%
Precision66.4%
TP27
FP13
FN43
40.5
pygoat / qwen-3.5-397b-agentic-v1
F2 Score41.0
F3 Score40.5
Recall40.0%
Precision46.2%
TP28
FP32
FN42
23.6
pygoat / semgrep
F2 Score21.8
F3 Score23.6
Recall25.7%
Precision13.5%
TP18
FP115
FN52
33.1
pygoat / snyk
F2 Score31.9
F3 Score33.1
Recall34.3%
Precision25.0%
TP24
FP72
FN46
16.9
pygoat / sonarqube
F2 Score18.2
F3 Score16.9
Recall15.7%
Precision50.0%
TP11
FP11
FN59
python-app 46.5
python-app / claude-haiku-4-5-agentic-v1
F2 Score48.2
F3 Score46.5
Recall45.0%
Precision68.2%
TP9
FP4
FN11
83.5
python-app / claude-opus-4-6-agentic-v1
F2 Score83.6
F3 Score83.5
Recall83.3%
Precision84.7%
TP17
FP3
FN3
72.5
python-app / claude-opus-4-7-agentic-v1
F2 Score72.5
F3 Score72.5
Recall72.5%
Precision72.6%
TP14
FP6
FN6
72.8
python-app / claude-sonnet-4-6-agentic-v1
F2 Score73.9
F3 Score72.8
Recall71.7%
Precision84.3%
TP14
FP3
FN6
66.4
python-app / gemini-3.1-pro-agentic-v1
F2 Score67.9
F3 Score66.4
Recall65.0%
Precision83.2%
TP13
FP3
FN7
64.5
python-app / glm-5-agentic-v1
F2 Score65.7
F3 Score64.5
Recall63.3%
Precision78.0%
TP13
FP4
FN7
70.0
python-app / glm-5.1-agentic-v1
F2 Score70.0
F3 Score70.0
Recall70.0%
Precision70.0%
TP14
FP6
FN6
31.8
python-app / grok-3-agentic-v1
F2 Score33.9
F3 Score31.8
Recall30.0%
Precision70.9%
TP6
FP2
FN14
37.4
python-app / grok-4.20-reasoning-agentic-v1
F2 Score40.0
F3 Score37.4
Recall35.0%
Precision94.4%
TP7
FP0
FN13
58.0
python-app / kimi-k2.5-agentic-v1
F2 Score57.8
F3 Score58.0
Recall58.3%
Precision55.9%
TP12
FP9
FN8
33.1
python-app / kimi-k2.6-agentic-v1
F2 Score33.9
F3 Score33.1
Recall32.5%
Precision40.6%
TP6
FP10
FN14
55.8
python-app / kolega-v0.0.1
F2 Score48.9
F3 Score55.8
Recall65.0%
Precision24.5%
TP13
FP40
FN7
44.7
python-app / minimax-m2.7-agentic-v1
F2 Score46.1
F3 Score44.7
Recall43.3%
Precision63.9%
TP9
FP5
FN11
34.3
python-app / qwen-3.5-397b-agentic-v1
F2 Score35.3
F3 Score34.3
Recall33.3%
Precision46.4%
TP7
FP7
FN13
21.4
python-app / sonarqube
F2 Score23.0
F3 Score21.4
Recall20.0%
Precision57.1%
TP4
FP3
FN16
python-insecure-app 48.1
python-insecure-app / claude-haiku-4-5-agentic-v1
F2 Score50.6
F3 Score48.1
Recall45.8%
Precision86.7%
TP4
FP1
FN4
56.4
python-insecure-app / claude-haiku-4-5-v1
F2 Score59.0
F3 Score56.4
Recall54.2%
Precision94.4%
TP4
FP0
FN4
76.9
python-insecure-app / claude-opus-4-6-agentic-v1
F2 Score78.9
F3 Score76.9
Recall75.0%
Precision100.0%
TP6
FP0
FN2
56.2
python-insecure-app / claude-opus-4-7-agentic-v1
F2 Score58.5
F3 Score56.2
Recall54.2%
Precision87.8%
TP4
FP1
FN4
39.5
python-insecure-app / gemini-3.1-pro-agentic-v1
F2 Score41.7
F3 Score39.5
Recall37.5%
Precision75.0%
TP3
FP1
FN5
52.6
python-insecure-app / glm-5-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP4
FP0
FN4
82.2
python-insecure-app / glm-5.1-agentic-v1
F2 Score83.2
F3 Score82.2
Recall81.2%
Precision93.8%
TP6
FP0
FN2
44.2
python-insecure-app / grok-3-agentic-v1
F2 Score47.1
F3 Score44.2
Recall41.7%
Precision100.0%
TP3
FP0
FN5
44.0
python-insecure-app / grok-4.20-reasoning-agentic-v1
F2 Score46.6
F3 Score44.0
Recall41.7%
Precision100.0%
TP3
FP0
FN5
50.8
python-insecure-app / kimi-k2.5-agentic-v1
F2 Score51.8
F3 Score50.8
Recall50.0%
Precision62.4%
TP4
FP3
FN4
55.8
python-insecure-app / kimi-k2.6-agentic-v1
F2 Score57.5
F3 Score55.8
Recall54.2%
Precision76.7%
TP4
FP1
FN4
72.2
python-insecure-app / kolega-v0.0.1
F2 Score61.4
F3 Score72.2
Recall87.5%
Precision28.0%
TP7
FP18
FN1
39.8
python-insecure-app / minimax-m2.7-agentic-v1
F2 Score42.3
F3 Score39.8
Recall37.5%
Precision87.5%
TP3
FP0
FN5
47.6
python-insecure-app / qwen-3.5-397b-agentic-v1
F2 Score49.6
F3 Score47.6
Recall45.8%
Precision73.3%
TP4
FP1
FN4
0.0
python-insecure-app / semgrep
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN8
13.3
python-insecure-app / snyk
F2 Score14.3
F3 Score13.3
Recall12.5%
Precision33.3%
TP1
FP2
FN7
0.0
python-insecure-app / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN8
pythonssti 50.9
pythonssti / claude-haiku-4-5-agentic-v1
F2 Score51.9
F3 Score50.9
Recall50.0%
Precision66.7%
TP1
FP1
FN1
50.9
pythonssti / claude-haiku-4-5-v1
F2 Score51.9
F3 Score50.9
Recall50.0%
Precision66.7%
TP1
FP1
FN1
100.0
pythonssti / claude-opus-4-7-agentic-v1
F2 Score100.0
F3 Score100.0
Recall100.0%
Precision100.0%
TP2
FP0
FN0
52.6
pythonssti / claude-sonnet-4-6-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
51.7
pythonssti / gemini-3.1-pro-agentic-v1
F2 Score53.7
F3 Score51.7
Recall50.0%
Precision83.3%
TP1
FP0
FN1
52.6
pythonssti / glm-5-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
100.0
pythonssti / glm-5.1-agentic-v1
F2 Score100.0
F3 Score100.0
Recall100.0%
Precision100.0%
TP2
FP0
FN0
52.6
pythonssti / grok-3-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
52.6
pythonssti / grok-4.20-reasoning-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
52.6
pythonssti / kimi-k2.5-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
84.2
pythonssti / kimi-k2.6-agentic-v1
F2 Score85.2
F3 Score84.2
Recall83.3%
Precision100.0%
TP2
FP0
FN0
66.7
pythonssti / kolega-v0.0.1
F2 Score50.0
F3 Score66.7
Recall100.0%
Precision16.7%
TP2
FP10
FN0
52.6
pythonssti / minimax-m2.7-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
52.6
pythonssti / qwen-3.5-397b-agentic-v1
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
52.6
pythonssti / semgrep
F2 Score55.6
F3 Score52.6
Recall50.0%
Precision100.0%
TP1
FP0
FN1
50.0
pythonssti / snyk
F2 Score50.0
F3 Score50.0
Recall50.0%
Precision50.0%
TP1
FP1
FN1
0.0
pythonssti / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN2
threatbyte 33.6
threatbyte / claude-haiku-4-5-agentic-v1
F2 Score35.4
F3 Score33.6
Recall31.9%
Precision62.7%
TP8
FP5
FN16
21.3
threatbyte / claude-haiku-4-5-v1
F2 Score21.8
F3 Score21.3
Recall20.8%
Precision27.2%
TP5
FP13
FN19
60.4
threatbyte / claude-opus-4-6-agentic-v1
F2 Score61.0
F3 Score60.4
Recall59.7%
Precision67.3%
TP14
FP7
FN10
54.4
threatbyte / claude-opus-4-7-agentic-v1
F2 Score56.2
F3 Score54.4
Recall52.8%
Precision75.7%
TP13
FP4
FN11
54.5
threatbyte / claude-sonnet-4-6-agentic-v1
F2 Score56.3
F3 Score54.5
Recall52.8%
Precision77.5%
TP13
FP4
FN11
42.9
threatbyte / gemini-3.1-pro-agentic-v1
F2 Score44.2
F3 Score42.9
Recall41.7%
Precision59.0%
TP10
FP7
FN14
46.2
threatbyte / glm-5-agentic-v1
F2 Score48.0
F3 Score46.2
Recall44.4%
Precision73.6%
TP11
FP4
FN13
58.2
threatbyte / glm-5.1-agentic-v1
F2 Score59.5
F3 Score58.2
Recall56.9%
Precision73.1%
TP14
FP5
FN10
25.5
threatbyte / grok-3-agentic-v1
F2 Score27.8
F3 Score25.5
Recall23.6%
Precision94.4%
TP6
FP0
FN18
21.0
threatbyte / grok-4.20-reasoning-agentic-v1
F2 Score22.9
F3 Score21.0
Recall19.4%
Precision82.2%
TP5
FP1
FN19
45.3
threatbyte / kimi-k2.5-agentic-v1
F2 Score46.1
F3 Score45.3
Recall44.5%
Precision54.4%
TP11
FP9
FN13
59.8
threatbyte / kimi-k2.6-agentic-v1
F2 Score61.4
F3 Score59.8
Recall58.3%
Precision77.8%
TP14
FP4
FN10
79.7
threatbyte / kolega-v0.0.1
F2 Score70.5
F3 Score79.7
Recall91.7%
Precision36.7%
TP22
FP38
FN2
30.8
threatbyte / minimax-m2.7-agentic-v1
F2 Score32.7
F3 Score30.8
Recall29.2%
Precision63.6%
TP7
FP4
FN17
36.3
threatbyte / qwen-3.5-397b-agentic-v1
F2 Score37.9
F3 Score36.3
Recall34.7%
Precision62.2%
TP8
FP5
FN16
8.6
threatbyte / semgrep
F2 Score8.8
F3 Score8.6
Recall8.3%
Precision11.8%
TP2
FP15
FN22
13.4
threatbyte / snyk
F2 Score14.4
F3 Score13.4
Recall12.5%
Precision37.5%
TP3
FP5
FN21
0.0
threatbyte / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN24
vampi 54.4
vampi / claude-haiku-4-5-agentic-v1
F2 Score55.0
F3 Score54.4
Recall53.8%
Precision61.8%
TP7
FP4
FN6
43.4
vampi / claude-haiku-4-5-v1
F2 Score43.3
F3 Score43.4
Recall43.6%
Precision43.3%
TP6
FP8
FN7
68.5
vampi / claude-opus-4-6-agentic-v1
F2 Score67.8
F3 Score68.5
Recall69.2%
Precision62.7%
TP9
FP5
FN4
72.2
vampi / claude-opus-4-7-agentic-v1
F2 Score72.5
F3 Score72.2
Recall71.8%
Precision75.6%
TP9
FP3
FN4
83.3
vampi / claude-sonnet-4-6-agentic-v1
F2 Score82.1
F3 Score83.3
Recall84.6%
Precision73.3%
TP11
FP4
FN2
75.7
vampi / gemini-3.1-pro-agentic-v1
F2 Score74.6
F3 Score75.7
Recall76.9%
Precision69.3%
TP10
FP5
FN3
51.8
vampi / grok-3-agentic-v1
F2 Score53.8
F3 Score51.8
Recall50.0%
Precision77.1%
TP6
FP2
FN6
40.9
vampi / grok-4.20-reasoning-agentic-v1
F2 Score43.5
F3 Score40.9
Recall38.5%
Precision94.4%
TP5
FP0
FN8
70.7
vampi / kimi-k2.5-agentic-v1
F2 Score69.8
F3 Score70.7
Recall71.8%
Precision70.4%
TP9
FP5
FN4
69.8
vampi / kimi-k2.6-agentic-v1
F2 Score70.5
F3 Score69.8
Recall69.2%
Precision76.2%
TP9
FP3
FN4
82.1
vampi / kolega-v0.0.1
F2 Score79.7
F3 Score82.1
Recall84.6%
Precision64.7%
TP11
FP6
FN2
67.0
vampi / minimax-m2.7-agentic-v1
F2 Score67.3
F3 Score67.0
Recall66.7%
Precision71.2%
TP9
FP4
FN4
48.3
vampi / qwen-3.5-397b-agentic-v1
F2 Score46.8
F3 Score48.3
Recall50.0%
Precision37.1%
TP6
FP11
FN6
0.0
vampi / semgrep
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN13
0.0
vampi / snyk
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP3
FN13
0.0
vampi / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN13
vfapi 56.6
vfapi / claude-haiku-4-5-agentic-v1
F2 Score57.7
F3 Score56.6
Recall55.6%
Precision69.4%
TP5
FP2
FN4
17.2
vfapi / claude-haiku-4-5-v1
F2 Score16.1
F3 Score17.2
Recall18.5%
Precision10.6%
TP2
FP15
FN7
87.0
vfapi / claude-opus-4-6-agentic-v1
F2 Score85.1
F3 Score87.0
Recall88.9%
Precision72.7%
TP8
FP3
FN1
86.3
vfapi / claude-opus-4-7-agentic-v1
F2 Score83.9
F3 Score86.3
Recall88.9%
Precision68.7%
TP8
FP4
FN1
82.4
vfapi / claude-sonnet-4-6-agentic-v1
F2 Score79.8
F3 Score82.4
Recall85.2%
Precision63.9%
TP8
FP4
FN1
67.0
vfapi / gemini-3.1-pro-agentic-v1
F2 Score67.2
F3 Score67.0
Recall66.7%
Precision69.9%
TP6
FP3
FN3
85.8
vfapi / glm-5-agentic-v1
F2 Score83.1
F3 Score85.8
Recall88.9%
Precision70.2%
TP8
FP4
FN1
93.0
vfapi / glm-5.1-agentic-v1
F2 Score90.0
F3 Score93.0
Recall96.3%
Precision72.8%
TP9
FP4
FN0
68.9
vfapi / grok-3-agentic-v1
F2 Score71.3
F3 Score68.9
Recall66.7%
Precision100.0%
TP6
FP0
FN3
57.9
vfapi / grok-4.20-reasoning-agentic-v1
F2 Score60.5
F3 Score57.9
Recall55.6%
Precision94.4%
TP5
FP0
FN4
86.9
vfapi / kimi-k2.5-agentic-v1
F2 Score79.2
F3 Score86.9
Recall96.3%
Precision46.7%
TP9
FP10
FN0
73.5
vfapi / kimi-k2.6-agentic-v1
F2 Score75.0
F3 Score73.5
Recall72.2%
Precision100.0%
TP6
FP0
FN2
81.1
vfapi / kolega-v0.0.1
F2 Score68.2
F3 Score81.1
Recall100.0%
Precision30.0%
TP9
FP21
FN0
60.3
vfapi / minimax-m2.7-agentic-v1
F2 Score59.9
F3 Score60.3
Recall61.1%
Precision70.0%
TP6
FP4
FN4
68.9
vfapi / qwen-3.5-397b-agentic-v1
F2 Score64.6
F3 Score68.9
Recall74.1%
Precision45.0%
TP7
FP9
FN2
12.2
vfapi / semgrep
F2 Score13.5
F3 Score12.2
Recall11.1%
Precision100.0%
TP1
FP0
FN8
11.8
vfapi / snyk
F2 Score12.5
F3 Score11.8
Recall11.1%
Precision25.0%
TP1
FP3
FN8
0.0
vfapi / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN9
vulnerable-api 47.5
vulnerable-api / claude-haiku-4-5-agentic-v1
F2 Score49.9
F3 Score47.5
Recall45.2%
Precision87.8%
TP6
FP1
FN8
61.1
vulnerable-api / claude-haiku-4-5-v1
F2 Score62.8
F3 Score61.1
Recall59.5%
Precision80.6%
TP8
FP2
FN6
72.3
vulnerable-api / claude-opus-4-6-agentic-v1
F2 Score73.2
F3 Score72.3
Recall71.4%
Precision81.2%
TP10
FP2
FN4
70.4
vulnerable-api / claude-opus-4-7-agentic-v1
F2 Score71.9
F3 Score70.4
Recall69.0%
Precision86.7%
TP10
FP2
FN4
69.4
vulnerable-api / claude-sonnet-4-6-agentic-v1
F2 Score69.7
F3 Score69.4
Recall69.0%
Precision73.6%
TP10
FP4
FN4
68.0
vulnerable-api / gemini-3.1-pro-agentic-v1
F2 Score69.3
F3 Score68.0
Recall66.7%
Precision82.3%
TP9
FP2
FN5
58.7
vulnerable-api / glm-5-agentic-v1
F2 Score60.3
F3 Score58.7
Recall57.1%
Precision77.6%
TP8
FP2
FN6
77.7
vulnerable-api / glm-5.1-agentic-v1
F2 Score76.9
F3 Score77.7
Recall78.6%
Precision71.5%
TP11
FP4
FN3
45.3
vulnerable-api / grok-3-agentic-v1
F2 Score47.9
F3 Score45.3
Recall42.9%
Precision91.7%
TP6
FP1
FN8
43.1
vulnerable-api / grok-4.20-reasoning-agentic-v1
F2 Score45.9
F3 Score43.1
Recall40.5%
Precision100.0%
TP6
FP0
FN8
53.7
vulnerable-api / kimi-k2.5-agentic-v1
F2 Score55.3
F3 Score53.7
Recall52.4%
Precision74.6%
TP7
FP3
FN7
62.8
vulnerable-api / kimi-k2.6-agentic-v1
F2 Score64.8
F3 Score62.8
Recall60.7%
Precision89.4%
TP8
FP1
FN6
80.2
vulnerable-api / kolega-v0.0.1
F2 Score70.7
F3 Score80.2
Recall92.9%
Precision36.1%
TP13
FP23
FN1
57.4
vulnerable-api / minimax-m2.7-agentic-v1
F2 Score57.8
F3 Score57.4
Recall57.1%
Precision68.5%
TP8
FP5
FN6
52.1
vulnerable-api / qwen-3.5-397b-agentic-v1
F2 Score54.4
F3 Score52.1
Recall50.0%
Precision85.8%
TP7
FP1
FN7
29.4
vulnerable-api / semgrep
F2 Score30.3
F3 Score29.4
Recall28.6%
Precision40.0%
TP4
FP6
FN10
15.5
vulnerable-api / snyk
F2 Score16.9
F3 Score15.5
Recall14.3%
Precision66.7%
TP2
FP1
FN12
0.0
vulnerable-api / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN14
vulnerable-flask-app 50.1
vulnerable-flask-app / claude-haiku-4-5-agentic-v1
F2 Score52.0
F3 Score50.1
Recall48.3%
Precision75.9%
TP10
FP3
FN10
32.3
vulnerable-flask-app / claude-haiku-4-5-v1
F2 Score33.0
F3 Score32.3
Recall31.7%
Precision39.9%
TP6
FP10
FN14
68.8
vulnerable-flask-app / claude-opus-4-6-agentic-v1
F2 Score69.4
F3 Score68.8
Recall68.3%
Precision74.7%
TP14
FP5
FN6
52.8
vulnerable-flask-app / claude-opus-4-7-agentic-v1
F2 Score54.0
F3 Score52.8
Recall51.7%
Precision66.9%
TP10
FP5
FN10
63.0
vulnerable-flask-app / claude-sonnet-4-6-agentic-v1
F2 Score64.4
F3 Score63.0
Recall61.7%
Precision80.5%
TP12
FP3
FN8
56.9
vulnerable-flask-app / gemini-3.1-pro-agentic-v1
F2 Score58.9
F3 Score56.9
Recall55.0%
Precision82.6%
TP11
FP2
FN9
66.9
vulnerable-flask-app / glm-5-agentic-v1
F2 Score69.0
F3 Score66.9
Recall65.0%
Precision92.4%
TP13
FP1
FN7
63.3
vulnerable-flask-app / glm-5.1-agentic-v1
F2 Score64.1
F3 Score63.3
Recall62.5%
Precision71.6%
TP12
FP5
FN8
24.9
vulnerable-flask-app / grok-3-agentic-v1
F2 Score26.6
F3 Score24.9
Recall23.3%
Precision62.5%
TP5
FP3
FN15
26.9
vulnerable-flask-app / grok-4.20-reasoning-agentic-v1
F2 Score29.2
F3 Score26.9
Recall25.0%
Precision88.9%
TP5
FP1
FN15
60.9
vulnerable-flask-app / kimi-k2.5-agentic-v1
F2 Score61.8
F3 Score60.9
Recall60.0%
Precision70.8%
TP12
FP5
FN8
57.8
vulnerable-flask-app / kimi-k2.6-agentic-v1
F2 Score57.6
F3 Score57.8
Recall58.3%
Precision66.4%
TP12
FP8
FN8
77.3
vulnerable-flask-app / kolega-v0.0.1
F2 Score67.7
F3 Score77.3
Recall90.0%
Precision34.0%
TP18
FP35
FN2
37.4
vulnerable-flask-app / minimax-m2.7-agentic-v1
F2 Score38.1
F3 Score37.4
Recall36.7%
Precision46.1%
TP7
FP8
FN13
42.8
vulnerable-flask-app / qwen-3.5-397b-agentic-v1
F2 Score44.0
F3 Score42.8
Recall41.7%
Precision56.7%
TP8
FP6
FN12
15.5
vulnerable-flask-app / semgrep
F2 Score16.0
F3 Score15.5
Recall15.0%
Precision21.4%
TP3
FP11
FN17
25.9
vulnerable-flask-app / snyk
F2 Score26.9
F3 Score25.9
Recall25.0%
Precision38.5%
TP5
FP8
FN15
0.0
vulnerable-flask-app / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN20
vulnerable-python-apps 69.9
vulnerable-python-apps / claude-opus-4-7-agentic-v1
F2 Score70.2
F3 Score69.9
Recall69.7%
Precision73.6%
TP15
FP5
FN7
55.8
vulnerable-python-apps / glm-5.1-agentic-v1
F2 Score57.0
F3 Score55.8
Recall54.5%
Precision70.9%
TP12
FP5
FN10
39.6
vulnerable-python-apps / kimi-k2.6-agentic-v1
F2 Score41.5
F3 Score39.6
Recall37.9%
Precision72.4%
TP8
FP3
FN14
65.8
vulnerable-python-apps / kolega-v0.0.1
F2 Score60.2
F3 Score65.8
Recall72.7%
Precision35.6%
TP16
FP29
FN6
9.5
vulnerable-python-apps / semgrep
F2 Score10.0
F3 Score9.5
Recall9.1%
Precision16.7%
TP2
FP10
FN20
9.6
vulnerable-python-apps / snyk
F2 Score10.1
F3 Score9.6
Recall9.1%
Precision18.2%
TP2
FP9
FN20
19.2
vulnerable-python-apps / sonarqube
F2 Score20.4
F3 Score19.2
Recall18.2%
Precision40.0%
TP4
FP6
FN18
vulnerable-tornado-app 51.7
vulnerable-tornado-app / claude-haiku-4-5-agentic-v1
F2 Score53.6
F3 Score51.7
Recall50.0%
Precision76.8%
TP7
FP2
FN7
29.8
vulnerable-tornado-app / claude-haiku-4-5-v1
F2 Score31.1
F3 Score29.8
Recall28.6%
Precision48.6%
TP4
FP4
FN10
65.0
vulnerable-tornado-app / claude-opus-4-6-agentic-v1
F2 Score65.7
F3 Score65.0
Recall64.3%
Precision72.1%
TP9
FP4
FN5
72.3
vulnerable-tornado-app / claude-opus-4-7-agentic-v1
F2 Score73.2
F3 Score72.3
Recall71.4%
Precision82.8%
TP10
FP2
FN4
57.6
vulnerable-tornado-app / claude-sonnet-4-6-agentic-v1
F2 Score58.0
F3 Score57.6
Recall57.1%
Precision61.5%
TP8
FP5
FN6
56.9
vulnerable-tornado-app / gemini-3.1-pro-agentic-v1
F2 Score59.3
F3 Score56.9
Recall54.8%
Precision90.0%
TP8
FP1
FN6
63.0
vulnerable-tornado-app / glm-5-agentic-v1
F2 Score64.1
F3 Score63.0
Recall61.9%
Precision77.0%
TP9
FP3
FN5
74.7
vulnerable-tornado-app / glm-5.1-agentic-v1
F2 Score75.6
F3 Score74.7
Recall73.8%
Precision84.7%
TP10
FP2
FN4
30.8
vulnerable-tornado-app / grok-3-agentic-v1
F2 Score33.3
F3 Score30.8
Recall28.6%
Precision100.0%
TP4
FP0
FN10
40.6
vulnerable-tornado-app / grok-4.20-reasoning-agentic-v1
F2 Score43.5
F3 Score40.6
Recall38.1%
Precision100.0%
TP5
FP0
FN9
53.6
vulnerable-tornado-app / kimi-k2.5-agentic-v1
F2 Score54.9
F3 Score53.6
Recall52.4%
Precision70.9%
TP7
FP3
FN7
54.2
vulnerable-tornado-app / kimi-k2.6-agentic-v1
F2 Score56.1
F3 Score54.2
Recall52.4%
Precision78.5%
TP7
FP2
FN7
88.1
vulnerable-tornado-app / kolega-v0.0.1
F2 Score78.7
F3 Score88.1
Recall100.0%
Precision42.4%
TP14
FP19
FN0
44.4
vulnerable-tornado-app / minimax-m2.7-agentic-v1
F2 Score46.2
F3 Score44.4
Recall42.9%
Precision66.7%
TP6
FP3
FN8
46.8
vulnerable-tornado-app / qwen-3.5-397b-agentic-v1
F2 Score48.4
F3 Score46.8
Recall45.2%
Precision68.0%
TP6
FP3
FN8
7.4
vulnerable-tornado-app / semgrep
F2 Score7.7
F3 Score7.4
Recall7.1%
Precision11.1%
TP1
FP8
FN13
0.0
vulnerable-tornado-app / snyk
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN14
0.0
vulnerable-tornado-app / sonarqube
F2 Score0.0
F3 Score0.0
Recall0.0%
Precision0.0%
TP0
FP0
FN14
vulnpy 48.5
vulnpy / claude-haiku-4-5-agentic-v1
F2 Score50.1
F3 Score48.5
Recall47.0%
Precision88.4%
TP37
FP7
FN41
46.6
vulnpy / claude-haiku-4-5-v1
F2 Score49.0
F3 Score46.6
Recall44.4%
Precision83.3%
TP35
FP7
FN43
72.4
vulnpy / claude-opus-4-6-agentic-v1
F2 Score73.6
F3 Score72.4
Recall71.4%
Precision84.6%
TP56
FP10
FN22
69.4
vulnpy / claude-sonnet-4-6-agentic-v1
F2 Score71.3
F3 Score69.4
Recall67.5%
Precision92.4%
TP53
FP5
FN25
86.9
vulnpy / gemini-3.1-pro-agentic-v1
F2 Score87.4
F3 Score86.9
Recall86.3%
Precision92.8%
TP67
FP6
FN11
71.8
vulnpy / glm-5-agentic-v1
F2 Score73.6
F3 Score71.8
Recall70.1%
Precision93.3%
TP55
FP4
FN23
63.1
vulnpy / glm-5.1-agentic-v1
F2 Score64.7
F3 Score63.1
Recall61.5%
Precision81.6%
TP48
FP11
FN30
5.6
vulnpy / grok-3-agentic-v1
F2 Score6.2
F3 Score5.6
Recall5.1%
Precision66.7%
TP4
FP2
FN74
34.7
vulnpy / grok-4.20-reasoning-agentic-v1
F2 Score35.2
F3 Score34.7
Recall34.2%
Precision94.2%
TP27
FP5
FN51
69.7
vulnpy / kimi-k2.5-agentic-v1
F2 Score71.6
F3 Score69.7
Recall68.0%
Precision91.4%
TP53
FP5
FN25
87.8
vulnpy / kimi-k2.6-agentic-v1
F2 Score87.3
F3 Score87.8
Recall88.5%
Precision84.3%
TP69
FP14
FN9
62.6
vulnpy / kolega-v0.0.1
F2 Score62.3
F3 Score62.6
Recall62.8%
Precision60.5%
TP49
FP32
FN29
63.0
vulnpy / minimax-m2.7-agentic-v1
F2 Score65.0
F3 Score63.0
Recall61.1%
Precision89.7%
TP48
FP5
FN30
56.7
vulnpy / qwen-3.5-397b-agentic-v1
F2 Score59.2
F3 Score56.7
Recall54.5%
Precision90.8%
TP42
FP4
FN36
17.6
vulnpy / semgrep
F2 Score18.7
F3 Score17.6
Recall16.7%
Precision37.1%
TP13
FP22
FN65
11.2
vulnpy / snyk
F2 Score12.4
F3 Score11.2
Recall10.3%
Precision72.7%
TP8
FP3
FN70
7.1
vulnpy / sonarqube
F2 Score7.9
F3 Score7.1
Recall6.4%
Precision100.0%
TP5
FP0
FN73
vulpy 19.4
vulpy / claude-haiku-4-5-agentic-v1
F2 Score21.2
F3 Score19.4
Recall17.9%
Precision80.1%
TP10
FP2
FN44
24.8
vulpy / claude-haiku-4-5-v1
F2 Score26.4
F3 Score24.8
Recall23.5%
Precision53.2%
TP13
FP11
FN41
50.5
vulpy / claude-opus-4-6-agentic-v1
F2 Score53.2
F3 Score50.5
Recall48.1%
Precision91.6%
TP26
FP2
FN28
31.3
vulpy / claude-opus-4-7-agentic-v1
F2 Score33.1
F3 Score31.3
Recall29.6%
Precision61.5%
TP16
FP10
FN38
34.4
vulpy / claude-sonnet-4-6-agentic-v1
F2 Score36.3
F3 Score34.4
Recall32.7%
Precision65.0%
TP18
FP10
FN36
36.6
vulpy / gemini-3.1-pro-agentic-v1
F2 Score38.8
F3 Score36.6
Recall34.6%
Precision77.4%
TP19
FP5
FN35
27.8
vulpy / glm-5-agentic-v1
F2 Score30.0
F3 Score27.8
Recall25.9%
Precision81.2%
TP14
FP3
FN40
35.8
vulpy / glm-5.1-agentic-v1
F2 Score37.8
F3 Score35.8
Recall34.0%
Precision70.5%
TP18
FP8
FN36
10.2
vulpy / grok-3-agentic-v1
F2 Score11.2
F3 Score10.2
Recall9.3%
Precision86.7%
TP5
FP1
FN49
9.5
vulpy / grok-4.20-reasoning-agentic-v1
F2 Score10.5
F3 Score9.5
Recall8.6%
Precision86.1%
TP5
FP1
FN49
25.0
vulpy / kimi-k2.5-agentic-v1
F2 Score26.8
F3 Score25.0
Recall23.5%
Precision65.6%
TP13
FP7
FN41
46.7
vulpy / kimi-k2.6-agentic-v1
F2 Score49.2
F3 Score46.7
Recall44.5%
Precision87.0%
TP24
FP4
FN30
78.5
vulpy / kolega-v0.0.1
F2 Score72.8
F3 Score78.5
Recall85.2%
Precision46.0%
TP46
FP54
FN8
37.4
vulpy / minimax-m2.7-agentic-v1
F2 Score39.9
F3 Score37.4
Recall35.2%
Precision86.4%
TP19
FP3
FN35
19.1
vulpy / qwen-3.5-397b-agentic-v1
F2 Score20.5
F3 Score19.1
Recall17.9%
Precision50.3%
TP10
FP10
FN44
22.6
vulpy / semgrep
F2 Score23.0
F3 Score22.6
Recall22.2%
Precision26.7%
TP12
FP33
FN42
13.8
vulpy / snyk
F2 Score14.8
F3 Score13.8
Recall13.0%
Precision35.0%
TP7
FP13
FN47
10.1
vulpy / sonarqube
F2 Score11.1
F3 Score10.1
Recall9.3%
Precision55.6%
TP5
FP4
FN49
AVERAGE (strict) 37.2
Average (strict)
Repos scored24 / 26
F3 (strict)37.2
F2 (strict)39.4
Recall35.2%
Precision74.8%
TP / FP / FN238 / 80 / 438
25.7
Average (strict)
Repos scored23 / 26
F3 (strict)25.7
F2 (strict)27.0
Recall24.4%
Precision47.7%
TP / FP / FN165 / 181 / 511
47.7
Average (strict)
Repos scored19 / 26
F3 (strict)47.7
F2 (strict)49.9
Recall45.6%
Precision79.0%
TP / FP / FN309 / 82 / 368
48.6
Average (strict)
Repos scored25 / 26
F3 (strict)48.6
F2 (strict)50.3
Recall46.9%
Precision71.2%
TP / FP / FN317 / 128 / 359
51.7
Average (strict)
Repos scored23 / 26
F3 (strict)51.7
F2 (strict)53.8
Recall49.9%
Precision78.5%
TP / FP / FN337 / 92 / 339
51.0
Average (strict)
Repos scored24 / 26
F3 (strict)51.0
F2 (strict)53.0
Recall49.1%
Precision77.4%
TP / FP / FN332 / 97 / 344
45.8
Average (strict)
Repos scored22 / 26
F3 (strict)45.8
F2 (strict)47.8
Recall43.9%
Precision75.3%
TP / FP / FN296 / 97 / 379
58.2
Average (strict)
Repos scored25 / 26
F3 (strict)58.2
F2 (strict)59.6
Recall56.9%
Precision73.4%
TP / FP / FN386 / 140 / 292
21.3
Average (strict)
Repos scored21 / 26
F3 (strict)21.3
F2 (strict)23.3
Recall19.7%
Precision83.7%
TP / FP / FN133 / 26 / 542
28.4
Average (strict)
Repos scored24 / 26
F3 (strict)28.4
F2 (strict)30.7
Recall26.3%
Precision92.7%
TP / FP / FN178 / 14 / 498
46.6
Average (strict)
Repos scored24 / 26
F3 (strict)46.6
F2 (strict)48.3
Recall45.0%
Precision68.3%
TP / FP / FN304 / 141 / 372
54.6
Average (strict)
Repos scored25 / 26
F3 (strict)54.6
F2 (strict)56.4
Recall52.9%
Precision77.1%
TP / FP / FN357 / 106 / 318
73.0
Average (strict)
Repos scored26 / 26
F3 (strict)73.0
F2 (strict)66.5
Recall80.9%
Precision38.8%
TP / FP / FN547 / 862 / 129
39.0
Average (strict)
Repos scored22 / 26
F3 (strict)39.0
F2 (strict)41.0
Recall37.1%
Precision70.7%
TP / FP / FN251 / 104 / 425
38.1
Average (strict)
Repos scored24 / 26
F3 (strict)38.1
F2 (strict)39.8
Recall36.6%
Precision61.8%
TP / FP / FN247 / 153 / 428
17.7
Average (strict)
Repos scored25 / 26
F3 (strict)17.7
F2 (strict)18.0
Recall17.5%
Precision20.5%
TP / FP / FN118 / 457 / 558
17.4
Average (strict)
Repos scored25 / 26
F3 (strict)17.4
F2 (strict)18.2
Recall16.7%
Precision28.2%
TP / FP / FN113 / 288 / 563
7.1
Average (strict)
Repos scored26 / 26
F3 (strict)7.1
F2 (strict)7.9
Recall6.5%
Precision61.1%
TP / FP / FN44 / 28 / 632
Cost Efficiency F2 Score vs Cost per Repo · LLM scanners only