F3 Score (0–100) measures how well a scanner finds vulnerabilities. It rewards finding real issues (recall) 9× more than avoiding false alarms (precision) — because in high-risk industries, missing a real vulnerability is far worse than a false positive. Strict mode penalizes scanners that fail or time out.
| Repository | claude-haiku-4-5-agentic-v1 | claude-haiku-4-5-v1 | claude-opus-4-6-agentic-v1 | claude-opus-4-7-agentic-v1 | claude-sonnet-4-6-agentic-v1 | gemini-3.1-pro-agentic-v1 | glm-5-agentic-v1 | glm-5.1-agentic-v1 | grok-3-agentic-v1 | grok-4.20-reasoning-agentic-v1 | kimi-k2.5-agentic-v1 | kimi-k2.6-agentic-v1 | kolega-v0.0.1 | minimax-m2.7-agentic-v1 | qwen-3.5-397b-agentic-v1 | semgrep | snyk | sonarqube |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| damn-vulnerable-flask-application | 47.5 damn-vulnerable-flask-application / claude-haiku-4-5-agentic-v1 F2 Score48.4 F3 Score47.5 Recall46.7% Precision57.1% TP7 FP5 FN8 |
42.9 damn-vulnerable-flask-application / claude-haiku-4-5-v1 F2 Score43.6 F3 Score42.9 Recall42.2% Precision51.0% TP6 FP6 FN9 |
77.4 damn-vulnerable-flask-application / claude-opus-4-6-agentic-v1 F2 Score77.0 F3 Score77.4 Recall77.8% Precision74.3% TP12 FP4 FN3 |
62.4 damn-vulnerable-flask-application / claude-opus-4-7-agentic-v1 F2 Score62.5 F3 Score62.4 Recall62.2% Precision64.6% TP9 FP5 FN6 |
69.6 damn-vulnerable-flask-application / claude-sonnet-4-6-agentic-v1 F2 Score70.3 F3 Score69.6 Recall68.9% Precision77.3% TP10 FP3 FN5 |
58.8 damn-vulnerable-flask-application / gemini-3.1-pro-agentic-v1 F2 Score59.9 F3 Score58.8 Recall57.8% Precision75.9% TP9 FP2 FN6 |
55.1 damn-vulnerable-flask-application / glm-5-agentic-v1 F2 Score57.0 F3 Score55.1 Recall53.3% Precision79.8% TP8 FP2 FN7 |
88.8 damn-vulnerable-flask-application / glm-5.1-agentic-v1 F2 Score87.7 F3 Score88.8 Recall90.0% Precision79.9% TP14 FP4 FN2 |
33.2 damn-vulnerable-flask-application / grok-3-agentic-v1 F2 Score35.5 F3 Score33.2 Recall31.1% Precision82.2% TP5 FP1 FN10 |
37.9 damn-vulnerable-flask-application / grok-4.20-reasoning-agentic-v1 F2 Score40.6 F3 Score37.9 Recall35.6% Precision93.3% TP5 FP0 FN10 |
48.6 damn-vulnerable-flask-application / kimi-k2.5-agentic-v1 F2 Score50.7 F3 Score48.6 Recall46.7% Precision77.8% TP7 FP2 FN8 |
76.0 damn-vulnerable-flask-application / kimi-k2.6-agentic-v1 F2 Score76.4 F3 Score76.0 Recall75.6% Precision82.6% TP11 FP2 FN4 |
67.0 damn-vulnerable-flask-application / kolega-v0.0.1 F2 Score57.7 F3 Score67.0 Recall80.0% Precision27.3% TP12 FP32 FN3 |
44.3 damn-vulnerable-flask-application / minimax-m2.7-agentic-v1 F2 Score45.4 F3 Score44.3 Recall43.3% Precision70.0% TP6 FP4 FN8 |
45.9 damn-vulnerable-flask-application / qwen-3.5-397b-agentic-v1 F2 Score47.4 F3 Score45.9 Recall44.4% Precision65.8% TP7 FP3 FN8 |
32.1 damn-vulnerable-flask-application / semgrep F2 Score30.9 F3 Score32.1 Recall33.3% Precision23.8% TP5 FP16 FN10 |
28.4 damn-vulnerable-flask-application / snyk F2 Score30.3 F3 Score28.4 Recall26.7% Precision66.7% TP4 FP2 FN11 |
0.0 damn-vulnerable-flask-application / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN15 |
| damn-vulnerable-graphql-application | 26.8 damn-vulnerable-graphql-application / claude-haiku-4-5-agentic-v1 F2 Score27.9 F3 Score26.8 Recall25.7% Precision44.7% TP9 FP11 FN26 |
9.0 damn-vulnerable-graphql-application / claude-haiku-4-5-v1 F2 Score9.4 F3 Score9.0 Recall8.6% Precision16.1% TP3 FP15 FN32 |
45.5 damn-vulnerable-graphql-application / claude-opus-4-6-agentic-v1 F2 Score46.8 F3 Score45.5 Recall44.3% Precision60.8% TP16 FP10 FN20 |
38.9 damn-vulnerable-graphql-application / claude-opus-4-7-agentic-v1 F2 Score40.8 F3 Score38.9 Recall37.1% Precision67.1% TP13 FP6 FN22 |
40.8 damn-vulnerable-graphql-application / claude-sonnet-4-6-agentic-v1 F2 Score42.7 F3 Score40.8 Recall39.1% Precision67.4% TP14 FP7 FN21 |
40.7 damn-vulnerable-graphql-application / gemini-3.1-pro-agentic-v1 F2 Score42.4 F3 Score40.7 Recall39.1% Precision65.1% TP14 FP7 FN21 |
42.5 damn-vulnerable-graphql-application / glm-5-agentic-v1 F2 Score42.1 F3 Score42.5 Recall42.9% Precision39.5% TP15 FP23 FN20 |
46.0 damn-vulnerable-graphql-application / glm-5.1-agentic-v1 F2 Score46.4 F3 Score46.0 Recall45.7% Precision52.5% TP16 FP16 FN19 |
— | 20.7 damn-vulnerable-graphql-application / grok-4.20-reasoning-agentic-v1 F2 Score22.6 F3 Score20.7 Recall19.1% Precision91.1% TP7 FP1 FN28 |
38.4 damn-vulnerable-graphql-application / kimi-k2.5-agentic-v1 F2 Score39.7 F3 Score38.4 Recall37.1% Precision55.1% TP13 FP11 FN22 |
44.5 damn-vulnerable-graphql-application / kimi-k2.6-agentic-v1 F2 Score46.4 F3 Score44.5 Recall42.9% Precision78.3% TP15 FP6 FN20 |
59.7 damn-vulnerable-graphql-application / kolega-v0.0.1 F2 Score54.8 F3 Score59.7 Recall65.7% Precision32.9% TP23 FP47 FN12 |
38.0 damn-vulnerable-graphql-application / minimax-m2.7-agentic-v1 F2 Score39.0 F3 Score38.0 Recall37.1% Precision49.9% TP13 FP14 FN22 |
35.8 damn-vulnerable-graphql-application / qwen-3.5-397b-agentic-v1 F2 Score35.7 F3 Score35.8 Recall36.2% Precision46.4% TP13 FP23 FN22 |
6.2 damn-vulnerable-graphql-application / semgrep F2 Score6.7 F3 Score6.2 Recall5.7% Precision22.2% TP2 FP7 FN33 |
0.0 damn-vulnerable-graphql-application / snyk F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP5 FN35 |
0.0 damn-vulnerable-graphql-application / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN35 |
| djangoat | 26.3 djangoat / claude-haiku-4-5-agentic-v1 F2 Score28.2 F3 Score26.3 Recall24.7% Precision66.4% TP12 FP6 FN38 |
8.5 djangoat / claude-haiku-4-5-v1 F2 Score9.1 F3 Score8.5 Recall8.0% Precision21.7% TP4 FP14 FN46 |
44.2 djangoat / claude-opus-4-6-agentic-v1 F2 Score46.7 F3 Score44.2 Recall42.0% Precision84.0% TP21 FP4 FN29 |
40.7 djangoat / claude-opus-4-7-agentic-v1 F2 Score42.3 F3 Score40.7 Recall39.3% Precision62.6% TP20 FP13 FN30 |
35.7 djangoat / claude-sonnet-4-6-agentic-v1 F2 Score37.7 F3 Score35.7 Recall34.0% Precision66.3% TP17 FP9 FN33 |
31.0 djangoat / gemini-3.1-pro-agentic-v1 F2 Score32.8 F3 Score31.0 Recall29.3% Precision65.8% TP15 FP9 FN35 |
33.2 djangoat / glm-5-agentic-v1 F2 Score35.2 F3 Score33.2 Recall31.3% Precision69.1% TP16 FP7 FN34 |
38.7 djangoat / glm-5.1-agentic-v1 F2 Score40.5 F3 Score38.7 Recall37.0% Precision66.6% TP18 FP10 FN32 |
10.2 djangoat / grok-3-agentic-v1 F2 Score11.3 F3 Score10.2 Recall9.3% Precision73.8% TP5 FP2 FN45 |
17.5 djangoat / grok-4.20-reasoning-agentic-v1 F2 Score19.2 F3 Score17.5 Recall16.0% Precision100.0% TP8 FP0 FN42 |
33.5 djangoat / kimi-k2.5-agentic-v1 F2 Score35.1 F3 Score33.5 Recall32.0% Precision60.9% TP16 FP12 FN34 |
42.4 djangoat / kimi-k2.6-agentic-v1 F2 Score44.0 F3 Score42.4 Recall41.0% Precision62.1% TP20 FP12 FN30 |
62.1 djangoat / kolega-v0.0.1 F2 Score54.5 F3 Score62.1 Recall72.0% Precision27.7% TP36 FP94 FN14 |
— | 26.2 djangoat / qwen-3.5-397b-agentic-v1 F2 Score28.0 F3 Score26.2 Recall24.7% Precision62.5% TP12 FP8 FN38 |
20.0 djangoat / semgrep F2 Score20.1 F3 Score20.0 Recall20.0% Precision20.4% TP10 FP39 FN40 |
18.9 djangoat / snyk F2 Score19.9 F3 Score18.9 Recall18.0% Precision34.6% TP9 FP17 FN41 |
0.0 djangoat / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN50 |
| dsvpwa | 39.9 dsvpwa / claude-haiku-4-5-agentic-v1 F2 Score42.6 F3 Score39.9 Recall37.5% Precision92.7% TP12 FP1 FN20 |
27.5 dsvpwa / claude-haiku-4-5-v1 F2 Score29.1 F3 Score27.5 Recall26.0% Precision55.0% TP8 FP8 FN24 |
— | 63.2 dsvpwa / claude-opus-4-7-agentic-v1 F2 Score63.9 F3 Score63.2 Recall62.5% Precision70.7% TP20 FP8 FN12 |
56.8 dsvpwa / claude-sonnet-4-6-agentic-v1 F2 Score58.5 F3 Score56.8 Recall55.2% Precision77.2% TP18 FP5 FN14 |
56.9 dsvpwa / gemini-3.1-pro-agentic-v1 F2 Score58.8 F3 Score56.9 Recall55.2% Precision79.6% TP18 FP4 FN14 |
67.3 dsvpwa / glm-5-agentic-v1 F2 Score69.2 F3 Score67.3 Recall65.6% Precision88.8% TP21 FP3 FN11 |
77.4 dsvpwa / glm-5.1-agentic-v1 F2 Score77.7 F3 Score77.4 Recall77.1% Precision80.5% TP25 FP6 FN7 |
67.3 dsvpwa / grok-3-agentic-v1 F2 Score69.1 F3 Score67.3 Recall65.6% Precision87.5% TP21 FP3 FN11 |
67.3 dsvpwa / grok-4.20-reasoning-agentic-v1 F2 Score69.1 F3 Score67.3 Recall65.6% Precision87.5% TP21 FP3 FN11 |
54.0 dsvpwa / kimi-k2.5-agentic-v1 F2 Score56.1 F3 Score54.0 Recall52.1% Precision81.1% TP17 FP4 FN15 |
— | 80.0 dsvpwa / kolega-v0.0.1 F2 Score73.7 F3 Score80.0 Recall87.5% Precision45.2% TP28 FP34 FN4 |
53.5 dsvpwa / minimax-m2.7-agentic-v1 F2 Score55.7 F3 Score53.5 Recall51.6% Precision86.6% TP16 FP2 FN16 |
37.4 dsvpwa / qwen-3.5-397b-agentic-v1 F2 Score39.6 F3 Score37.4 Recall35.4% Precision78.3% TP11 FP4 FN21 |
19.9 dsvpwa / semgrep F2 Score21.3 F3 Score19.9 Recall18.8% Precision46.2% TP6 FP7 FN26 |
10.2 dsvpwa / snyk F2 Score11.1 F3 Score10.2 Recall9.4% Precision42.9% TP3 FP4 FN29 |
0.0 dsvpwa / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN32 |
| dsvw | 51.9 dsvw / claude-haiku-4-5-agentic-v1 F2 Score54.8 F3 Score51.9 Recall49.4% Precision97.4% TP13 FP0 FN14 |
25.7 dsvw / claude-haiku-4-5-v1 F2 Score26.8 F3 Score25.7 Recall24.7% Precision41.2% TP7 FP10 FN20 |
— | 71.8 dsvw / claude-opus-4-7-agentic-v1 F2 Score73.3 F3 Score71.8 Recall70.4% Precision88.3% TP19 FP3 FN8 |
75.8 dsvw / claude-sonnet-4-6-agentic-v1 F2 Score77.7 F3 Score75.8 Recall74.1% Precision96.9% TP20 FP1 FN7 |
61.2 dsvw / gemini-3.1-pro-agentic-v1 F2 Score62.0 F3 Score61.2 Recall60.5% Precision69.2% TP16 FP8 FN11 |
60.3 dsvw / glm-5-agentic-v1 F2 Score62.8 F3 Score60.3 Recall58.0% Precision94.0% TP16 FP1 FN11 |
72.9 dsvw / glm-5.1-agentic-v1 F2 Score73.6 F3 Score72.9 Recall72.2% Precision79.5% TP20 FP5 FN8 |
49.6 dsvw / grok-3-agentic-v1 F2 Score52.5 F3 Score49.6 Recall46.9% Precision100.0% TP13 FP0 FN14 |
43.3 dsvw / grok-4.20-reasoning-agentic-v1 F2 Score46.2 F3 Score43.3 Recall40.7% Precision100.0% TP11 FP0 FN16 |
63.5 dsvw / kimi-k2.5-agentic-v1 F2 Score65.5 F3 Score63.5 Recall61.7% Precision86.7% TP17 FP3 FN10 |
72.2 dsvw / kimi-k2.6-agentic-v1 F2 Score74.2 F3 Score72.2 Recall70.4% Precision95.0% TP19 FP1 FN8 |
87.7 dsvw / kolega-v0.0.1 F2 Score83.3 F3 Score87.7 Recall92.6% Precision59.5% TP25 FP17 FN2 |
57.0 dsvw / minimax-m2.7-agentic-v1 F2 Score58.6 F3 Score57.0 Recall55.6% Precision75.1% TP15 FP5 FN12 |
59.0 dsvw / qwen-3.5-397b-agentic-v1 F2 Score60.1 F3 Score59.0 Recall58.0% Precision77.9% TP16 FP6 FN11 |
0.0 dsvw / semgrep F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN27 |
27.3 dsvw / snyk F2 Score28.9 F3 Score27.3 Recall25.9% Precision53.8% TP7 FP6 FN20 |
0.0 dsvw / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN27 |
| dvblab | 54.1 dvblab / claude-haiku-4-5-agentic-v1 F2 Score55.3 F3 Score54.1 Recall53.0% Precision67.5% TP12 FP6 FN10 |
37.0 dvblab / claude-haiku-4-5-v1 F2 Score37.6 F3 Score37.0 Recall36.4% Precision44.3% TP8 FP10 FN14 |
— | 63.5 dvblab / claude-opus-4-7-agentic-v1 F2 Score65.0 F3 Score63.5 Recall62.1% Precision80.7% TP14 FP3 FN8 |
69.1 dvblab / claude-sonnet-4-6-agentic-v1 F2 Score70.1 F3 Score69.1 Recall68.2% Precision79.3% TP15 FP4 FN7 |
60.4 dvblab / gemini-3.1-pro-agentic-v1 F2 Score61.7 F3 Score60.4 Recall59.1% Precision75.0% TP13 FP4 FN9 |
63.2 dvblab / glm-5-agentic-v1 F2 Score65.2 F3 Score63.2 Recall61.4% Precision87.1% TP14 FP2 FN8 |
69.6 dvblab / glm-5.1-agentic-v1 F2 Score69.6 F3 Score69.6 Recall69.7% Precision69.8% TP15 FP7 FN7 |
46.3 dvblab / grok-3-agentic-v1 F2 Score49.0 F3 Score46.3 Recall43.9% Precision93.0% TP10 FP1 FN12 |
43.3 dvblab / grok-4.20-reasoning-agentic-v1 F2 Score46.1 F3 Score43.3 Recall40.9% Precision97.2% TP9 FP0 FN13 |
58.8 dvblab / kimi-k2.5-agentic-v1 F2 Score60.1 F3 Score58.8 Recall57.6% Precision75.2% TP13 FP5 FN9 |
47.4 dvblab / kimi-k2.6-agentic-v1 F2 Score49.5 F3 Score47.4 Recall45.5% Precision86.6% TP10 FP2 FN12 |
73.6 dvblab / kolega-v0.0.1 F2 Score64.2 F3 Score73.6 Recall86.4% Precision31.7% TP19 FP41 FN3 |
52.6 dvblab / minimax-m2.7-agentic-v1 F2 Score53.0 F3 Score52.6 Recall52.3% Precision56.2% TP12 FP9 FN10 |
61.3 dvblab / qwen-3.5-397b-agentic-v1 F2 Score63.7 F3 Score61.3 Recall59.1% Precision93.5% TP13 FP1 FN9 |
35.1 dvblab / semgrep F2 Score33.9 F3 Score35.1 Recall36.4% Precision26.7% TP8 FP22 FN14 |
37.0 dvblab / snyk F2 Score37.7 F3 Score37.0 Recall36.4% Precision44.4% TP8 FP10 FN14 |
0.0 dvblab / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN22 |
| dvpwa | 33.9 dvpwa / claude-haiku-4-5-agentic-v1 F2 Score36.3 F3 Score33.9 Recall31.8% Precision84.1% TP7 FP1 FN15 |
20.6 dvpwa / claude-haiku-4-5-v1 F2 Score21.7 F3 Score20.6 Recall19.7% Precision37.7% TP4 FP8 FN18 |
— | 37.0 dvpwa / claude-opus-4-7-agentic-v1 F2 Score39.4 F3 Score37.0 Recall34.8% Precision82.5% TP8 FP2 FN14 |
51.6 dvpwa / claude-sonnet-4-6-agentic-v1 F2 Score53.4 F3 Score51.6 Recall50.0% Precision73.3% TP11 FP4 FN11 |
45.3 dvpwa / gemini-3.1-pro-agentic-v1 F2 Score46.7 F3 Score45.3 Recall43.9% Precision65.0% TP10 FP5 FN12 |
35.6 dvpwa / glm-5-agentic-v1 F2 Score37.2 F3 Score35.6 Recall34.1% Precision61.9% TP8 FP6 FN14 |
58.6 dvpwa / glm-5.1-agentic-v1 F2 Score59.6 F3 Score58.6 Recall57.6% Precision70.2% TP13 FP6 FN9 |
11.6 dvpwa / grok-3-agentic-v1 F2 Score12.9 F3 Score11.6 Recall10.6% Precision100.0% TP2 FP0 FN20 |
24.6 dvpwa / grok-4.20-reasoning-agentic-v1 F2 Score26.9 F3 Score24.6 Recall22.7% Precision100.0% TP5 FP0 FN17 |
49.8 dvpwa / kimi-k2.5-agentic-v1 F2 Score51.3 F3 Score49.8 Recall48.5% Precision70.2% TP11 FP5 FN11 |
57.5 dvpwa / kimi-k2.6-agentic-v1 F2 Score58.2 F3 Score57.5 Recall56.8% Precision65.8% TP12 FP7 FN10 |
80.2 dvpwa / kolega-v0.0.1 F2 Score74.8 F3 Score80.2 Recall86.4% Precision48.7% TP19 FP20 FN3 |
25.3 dvpwa / minimax-m2.7-agentic-v1 F2 Score26.5 F3 Score25.3 Recall24.2% Precision52.4% TP5 FP7 FN17 |
30.4 dvpwa / qwen-3.5-397b-agentic-v1 F2 Score32.2 F3 Score30.4 Recall28.8% Precision61.6% TP6 FP4 FN16 |
9.6 dvpwa / semgrep F2 Score10.2 F3 Score9.6 Recall9.1% Precision20.0% TP2 FP8 FN20 |
5.0 dvpwa / snyk F2 Score5.6 F3 Score5.0 Recall4.5% Precision100.0% TP1 FP0 FN21 |
0.0 dvpwa / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN22 |
| extremely-vulnerable-flask-app | 34.1 extremely-vulnerable-flask-app / claude-haiku-4-5-agentic-v1 F2 Score36.4 F3 Score34.1 Recall32.1% Precision78.8% TP9 FP3 FN19 |
27.7 extremely-vulnerable-flask-app / claude-haiku-4-5-v1 F2 Score29.4 F3 Score27.7 Recall26.2% Precision60.8% TP7 FP5 FN21 |
55.4 extremely-vulnerable-flask-app / claude-opus-4-6-agentic-v1 F2 Score57.3 F3 Score55.4 Recall53.6% Precision79.0% TP15 FP4 FN13 |
51.9 extremely-vulnerable-flask-app / claude-opus-4-7-agentic-v1 F2 Score53.8 F3 Score51.9 Recall50.0% Precision77.8% TP14 FP4 FN14 |
59.6 extremely-vulnerable-flask-app / claude-sonnet-4-6-agentic-v1 F2 Score62.2 F3 Score59.6 Recall57.1% Precision97.1% TP16 FP0 FN12 |
51.8 extremely-vulnerable-flask-app / gemini-3.1-pro-agentic-v1 F2 Score53.9 F3 Score51.8 Recall50.0% Precision79.5% TP14 FP4 FN14 |
— | 53.5 extremely-vulnerable-flask-app / glm-5.1-agentic-v1 F2 Score55.4 F3 Score53.5 Recall51.8% Precision76.9% TP14 FP4 FN14 |
— | 40.4 extremely-vulnerable-flask-app / grok-4.20-reasoning-agentic-v1 F2 Score42.9 F3 Score40.4 Recall38.1% Precision90.1% TP11 FP1 FN17 |
44.9 extremely-vulnerable-flask-app / kimi-k2.5-agentic-v1 F2 Score47.0 F3 Score44.9 Recall42.9% Precision81.6% TP12 FP3 FN16 |
56.6 extremely-vulnerable-flask-app / kimi-k2.6-agentic-v1 F2 Score58.5 F3 Score56.6 Recall54.8% Precision81.4% TP15 FP4 FN13 |
90.9 extremely-vulnerable-flask-app / kolega-v0.0.1 F2 Score83.3 F3 Score90.9 Recall100.0% Precision50.0% TP28 FP28 FN0 |
38.0 extremely-vulnerable-flask-app / minimax-m2.7-agentic-v1 F2 Score40.7 F3 Score38.0 Recall35.7% Precision90.9% TP10 FP1 FN18 |
28.0 extremely-vulnerable-flask-app / qwen-3.5-397b-agentic-v1 F2 Score30.1 F3 Score28.0 Recall26.2% Precision75.9% TP7 FP2 FN21 |
11.3 extremely-vulnerable-flask-app / semgrep F2 Score11.9 F3 Score11.3 Recall10.7% Precision21.4% TP3 FP11 FN25 |
19.2 extremely-vulnerable-flask-app / snyk F2 Score20.8 F3 Score19.2 Recall17.9% Precision62.5% TP5 FP3 FN23 |
15.5 extremely-vulnerable-flask-app / sonarqube F2 Score16.9 F3 Score15.5 Recall14.3% Precision66.7% TP4 FP2 FN24 |
| flask-xss | 30.5 flask-xss / claude-haiku-4-5-agentic-v1 F2 Score32.8 F3 Score30.5 Recall28.6% Precision81.8% TP8 FP2 FN20 |
43.4 flask-xss / claude-haiku-4-5-v1 F2 Score45.4 F3 Score43.4 Recall41.7% Precision70.9% TP12 FP5 FN16 |
45.1 flask-xss / claude-opus-4-6-agentic-v1 F2 Score47.6 F3 Score45.1 Recall42.9% Precision85.7% TP12 FP2 FN16 |
26.0 flask-xss / claude-opus-4-7-agentic-v1 F2 Score27.0 F3 Score26.0 Recall25.0% Precision84.2% TP7 FP3 FN21 |
41.5 flask-xss / claude-sonnet-4-6-agentic-v1 F2 Score44.0 F3 Score41.5 Recall39.3% Precision84.6% TP11 FP2 FN17 |
44.7 flask-xss / gemini-3.1-pro-agentic-v1 F2 Score46.8 F3 Score44.7 Recall42.9% Precision75.7% TP12 FP4 FN16 |
50.8 flask-xss / glm-5-agentic-v1 F2 Score53.0 F3 Score50.8 Recall48.8% Precision80.3% TP14 FP3 FN14 |
45.2 flask-xss / glm-5.1-agentic-v1 F2 Score47.9 F3 Score45.2 Recall42.9% Precision90.8% TP12 FP1 FN16 |
15.6 flask-xss / grok-3-agentic-v1 F2 Score17.0 F3 Score15.6 Recall14.3% Precision74.4% TP4 FP1 FN24 |
19.5 flask-xss / grok-4.20-reasoning-agentic-v1 F2 Score21.3 F3 Score19.5 Recall17.9% Precision100.0% TP5 FP0 FN23 |
42.2 flask-xss / kimi-k2.5-agentic-v1 F2 Score44.2 F3 Score42.2 Recall40.5% Precision74.5% TP11 FP5 FN17 |
37.9 flask-xss / kimi-k2.6-agentic-v1 F2 Score40.3 F3 Score37.9 Recall35.7% Precision83.3% TP10 FP2 FN18 |
75.9 flask-xss / kolega-v0.0.1 F2 Score68.2 F3 Score75.9 Recall85.7% Precision37.5% TP24 FP40 FN4 |
38.9 flask-xss / minimax-m2.7-agentic-v1 F2 Score41.1 F3 Score38.9 Recall36.9% Precision75.5% TP10 FP3 FN18 |
30.5 flask-xss / qwen-3.5-397b-agentic-v1 F2 Score32.8 F3 Score30.5 Recall28.6% Precision80.9% TP8 FP2 FN20 |
11.7 flask-xss / semgrep F2 Score12.9 F3 Score11.7 Recall10.7% Precision75.0% TP3 FP1 FN25 |
3.9 flask-xss / snyk F2 Score4.4 F3 Score3.9 Recall3.6% Precision50.0% TP1 FP1 FN27 |
0.0 flask-xss / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN28 |
| insecure-web | 68.7 insecure-web / claude-haiku-4-5-agentic-v1 F2 Score70.9 F3 Score68.7 Recall66.7% Precision95.2% TP6 FP0 FN3 |
63.2 insecure-web / claude-haiku-4-5-v1 F2 Score63.4 F3 Score63.2 Recall63.0% Precision65.3% TP6 FP3 FN3 |
76.1 insecure-web / claude-opus-4-6-agentic-v1 F2 Score74.5 F3 Score76.1 Recall77.8% Precision63.6% TP7 FP4 FN2 |
73.5 insecure-web / claude-opus-4-7-agentic-v1 F2 Score73.0 F3 Score73.5 Recall74.1% Precision69.4% TP7 FP3 FN2 |
70.6 insecure-web / claude-sonnet-4-6-agentic-v1 F2 Score70.9 F3 Score70.6 Recall70.4% Precision73.2% TP6 FP2 FN3 |
63.5 insecure-web / gemini-3.1-pro-agentic-v1 F2 Score64.0 F3 Score63.5 Recall63.0% Precision72.8% TP6 FP2 FN3 |
72.4 insecure-web / glm-5-agentic-v1 F2 Score70.9 F3 Score72.4 Recall74.1% Precision62.9% TP7 FP4 FN2 |
77.2 insecure-web / glm-5.1-agentic-v1 F2 Score76.7 F3 Score77.2 Recall77.8% Precision72.6% TP7 FP3 FN2 |
68.7 insecure-web / grok-3-agentic-v1 F2 Score70.9 F3 Score68.7 Recall66.7% Precision95.2% TP6 FP0 FN3 |
58.1 insecure-web / grok-4.20-reasoning-agentic-v1 F2 Score61.0 F3 Score58.1 Recall55.6% Precision100.0% TP5 FP0 FN4 |
65.8 insecure-web / kimi-k2.5-agentic-v1 F2 Score65.1 F3 Score65.8 Recall66.7% Precision64.1% TP6 FP4 FN3 |
70.9 insecure-web / kimi-k2.6-agentic-v1 F2 Score71.5 F3 Score70.9 Recall70.4% Precision76.4% TP6 FP2 FN3 |
79.6 insecure-web / kolega-v0.0.1 F2 Score66.2 F3 Score79.6 Recall100.0% Precision28.1% TP9 FP23 FN0 |
— | 60.1 insecure-web / qwen-3.5-397b-agentic-v1 F2 Score61.0 F3 Score60.1 Recall59.3% Precision70.8% TP5 FP2 FN4 |
51.5 insecure-web / semgrep F2 Score48.1 F3 Score51.5 Recall55.6% Precision31.2% TP5 FP11 FN4 |
57.5 insecure-web / snyk F2 Score59.5 F3 Score57.5 Recall55.6% Precision83.3% TP5 FP1 FN4 |
35.3 insecure-web / sonarqube F2 Score37.5 F3 Score35.3 Recall33.3% Precision75.0% TP3 FP1 FN6 |
| intentionally-vulnerable-python-application | 63.0 intentionally-vulnerable-python-application / claude-haiku-4-5-agentic-v1 F2 Score64.2 F3 Score63.0 Recall61.9% Precision77.1% TP4 FP1 FN3 |
63.7 intentionally-vulnerable-python-application / claude-haiku-4-5-v1 F2 Score65.7 F3 Score63.7 Recall61.9% Precision86.7% TP4 FP1 FN3 |
72.5 intentionally-vulnerable-python-application / claude-opus-4-6-agentic-v1 F2 Score73.7 F3 Score72.5 Recall71.4% Precision87.5% TP5 FP1 FN2 |
81.3 intentionally-vulnerable-python-application / claude-opus-4-7-agentic-v1 F2 Score81.6 F3 Score81.3 Recall81.0% Precision84.9% TP6 FP1 FN1 |
58.8 intentionally-vulnerable-python-application / claude-sonnet-4-6-agentic-v1 F2 Score60.6 F3 Score58.8 Recall57.1% Precision80.0% TP4 FP1 FN3 |
72.5 intentionally-vulnerable-python-application / gemini-3.1-pro-agentic-v1 F2 Score73.5 F3 Score72.5 Recall71.4% Precision83.3% TP5 FP1 FN2 |
62.9 intentionally-vulnerable-python-application / glm-5-agentic-v1 F2 Score61.6 F3 Score62.9 Recall64.3% Precision52.8% TP4 FP4 FN2 |
69.5 intentionally-vulnerable-python-application / glm-5.1-agentic-v1 F2 Score67.8 F3 Score69.5 Recall71.4% Precision58.4% TP5 FP4 FN2 |
— | 54.1 intentionally-vulnerable-python-application / grok-4.20-reasoning-agentic-v1 F2 Score56.0 F3 Score54.1 Recall52.4% Precision78.3% TP4 FP1 FN3 |
72.3 intentionally-vulnerable-python-application / kimi-k2.5-agentic-v1 F2 Score73.2 F3 Score72.3 Recall71.4% Precision85.7% TP5 FP1 FN2 |
71.9 intentionally-vulnerable-python-application / kimi-k2.6-agentic-v1 F2 Score72.5 F3 Score71.9 Recall71.4% Precision79.4% TP5 FP1 FN2 |
69.8 intentionally-vulnerable-python-application / kolega-v0.0.1 F2 Score58.8 F3 Score69.8 Recall85.7% Precision26.1% TP6 FP17 FN1 |
58.8 intentionally-vulnerable-python-application / minimax-m2.7-agentic-v1 F2 Score60.6 F3 Score58.8 Recall57.1% Precision82.2% TP4 FP1 FN3 |
58.5 intentionally-vulnerable-python-application / qwen-3.5-397b-agentic-v1 F2 Score60.0 F3 Score58.5 Recall57.1% Precision75.6% TP4 FP1 FN3 |
29.4 intentionally-vulnerable-python-application / semgrep F2 Score30.3 F3 Score29.4 Recall28.6% Precision40.0% TP2 FP3 FN5 |
57.1 intentionally-vulnerable-python-application / snyk F2 Score57.1 F3 Score57.1 Recall57.1% Precision57.1% TP4 FP3 FN3 |
0.0 intentionally-vulnerable-python-application / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN7 |
| lets-be-bad-guys | 54.9 lets-be-bad-guys / claude-haiku-4-5-agentic-v1 F2 Score57.2 F3 Score54.9 Recall52.8% Precision86.7% TP13 FP2 FN11 |
31.5 lets-be-bad-guys / claude-haiku-4-5-v1 F2 Score32.5 F3 Score31.5 Recall30.6% Precision44.7% TP7 FP10 FN17 |
76.8 lets-be-bad-guys / claude-opus-4-6-agentic-v1 F2 Score78.7 F3 Score76.8 Recall75.0% Precision98.2% TP18 FP0 FN6 |
50.3 lets-be-bad-guys / claude-opus-4-7-agentic-v1 F2 Score52.1 F3 Score50.3 Recall48.6% Precision74.0% TP12 FP4 FN12 |
64.7 lets-be-bad-guys / claude-sonnet-4-6-agentic-v1 F2 Score67.0 F3 Score64.7 Recall62.5% Precision93.7% TP15 FP1 FN9 |
63.4 lets-be-bad-guys / gemini-3.1-pro-agentic-v1 F2 Score65.8 F3 Score63.4 Recall61.1% Precision95.7% TP15 FP1 FN9 |
53.4 lets-be-bad-guys / glm-5-agentic-v1 F2 Score55.6 F3 Score53.4 Recall51.4% Precision83.5% TP12 FP3 FN12 |
61.8 lets-be-bad-guys / glm-5.1-agentic-v1 F2 Score64.0 F3 Score61.8 Recall59.7% Precision90.1% TP14 FP2 FN10 |
41.6 lets-be-bad-guys / grok-3-agentic-v1 F2 Score43.8 F3 Score41.6 Recall39.6% Precision76.0% TP10 FP3 FN14 |
25.5 lets-be-bad-guys / grok-4.20-reasoning-agentic-v1 F2 Score27.9 F3 Score25.5 Recall23.6% Precision100.0% TP6 FP0 FN18 |
58.1 lets-be-bad-guys / kimi-k2.5-agentic-v1 F2 Score59.3 F3 Score58.1 Recall57.0% Precision71.9% TP14 FP6 FN10 |
64.7 lets-be-bad-guys / kimi-k2.6-agentic-v1 F2 Score67.1 F3 Score64.7 Recall62.5% Precision95.7% TP15 FP1 FN9 |
88.8 lets-be-bad-guys / kolega-v0.0.1 F2 Score82.7 F3 Score88.8 Recall95.8% Precision53.5% TP23 FP20 FN1 |
45.5 lets-be-bad-guys / minimax-m2.7-agentic-v1 F2 Score47.3 F3 Score45.5 Recall43.8% Precision70.1% TP10 FP4 FN14 |
45.5 lets-be-bad-guys / qwen-3.5-397b-agentic-v1 F2 Score46.5 F3 Score45.5 Recall44.4% Precision57.5% TP11 FP8 FN13 |
38.6 lets-be-bad-guys / semgrep F2 Score39.8 F3 Score38.6 Recall37.5% Precision52.9% TP9 FP8 FN15 |
35.4 lets-be-bad-guys / snyk F2 Score37.7 F3 Score35.4 Recall33.3% Precision80.0% TP8 FP2 FN16 |
35.6 lets-be-bad-guys / sonarqube F2 Score38.1 F3 Score35.6 Recall33.3% Precision88.9% TP8 FP1 FN16 |
| owasp-web-playground | — | — | — | 59.3 owasp-web-playground / claude-opus-4-7-agentic-v1 F2 Score60.1 F3 Score59.3 Recall58.6% Precision70.1% TP17 FP8 FN12 |
— | — | — | 59.8 owasp-web-playground / glm-5.1-agentic-v1 F2 Score60.9 F3 Score59.8 Recall58.6% Precision72.5% TP17 FP6 FN12 |
— | — | — | 67.1 owasp-web-playground / kimi-k2.6-agentic-v1 F2 Score68.8 F3 Score67.1 Recall65.5% Precision86.2% TP19 FP3 FN10 |
83.3 owasp-web-playground / kolega-v0.0.1 F2 Score75.4 F3 Score83.3 Recall93.1% Precision42.9% TP27 FP36 FN2 |
— | — | 16.2 owasp-web-playground / semgrep F2 Score13.3 F3 Score16.2 Recall20.7% Precision5.5% TP6 FP104 FN23 |
10.6 owasp-web-playground / snyk F2 Score8.5 F3 Score10.6 Recall13.8% Precision3.4% TP4 FP114 FN25 |
0.0 owasp-web-playground / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN29 |
| pygoat | 32.7 pygoat / claude-haiku-4-5-agentic-v1 F2 Score34.7 F3 Score32.7 Recall30.9% Precision67.5% TP22 FP10 FN48 |
7.2 pygoat / claude-haiku-4-5-v1 F2 Score7.8 F3 Score7.2 Recall6.7% Precision24.1% TP5 FP15 FN65 |
51.0 pygoat / claude-opus-4-6-agentic-v1 F2 Score52.8 F3 Score51.0 Recall49.3% Precision74.5% TP34 FP12 FN36 |
58.3 pygoat / claude-opus-4-7-agentic-v1 F2 Score58.8 F3 Score58.3 Recall57.9% Precision65.6% TP40 FP23 FN30 |
48.0 pygoat / claude-sonnet-4-6-agentic-v1 F2 Score49.9 F3 Score48.0 Recall46.2% Precision74.4% TP32 FP11 FN38 |
42.3 pygoat / gemini-3.1-pro-agentic-v1 F2 Score44.3 F3 Score42.3 Recall40.5% Precision72.9% TP28 FP11 FN42 |
36.7 pygoat / glm-5-agentic-v1 F2 Score38.4 F3 Score36.7 Recall35.2% Precision73.7% TP25 FP14 FN45 |
57.7 pygoat / glm-5.1-agentic-v1 F2 Score59.0 F3 Score57.7 Recall56.4% Precision71.8% TP40 FP16 FN30 |
8.3 pygoat / grok-3-agentic-v1 F2 Score9.2 F3 Score8.3 Recall7.6% Precision49.4% TP5 FP4 FN65 |
10.4 pygoat / grok-4.20-reasoning-agentic-v1 F2 Score11.5 F3 Score10.4 Recall9.5% Precision100.0% TP7 FP0 FN63 |
35.9 pygoat / kimi-k2.5-agentic-v1 F2 Score37.3 F3 Score35.9 Recall34.8% Precision57.2% TP24 FP21 FN46 |
46.4 pygoat / kimi-k2.6-agentic-v1 F2 Score48.1 F3 Score46.4 Recall44.8% Precision70.5% TP31 FP13 FN39 |
62.7 pygoat / kolega-v0.0.1 F2 Score57.7 F3 Score62.7 Recall68.6% Precision35.3% TP48 FP88 FN22 |
39.8 pygoat / minimax-m2.7-agentic-v1 F2 Score41.6 F3 Score39.8 Recall38.1% Precision66.4% TP27 FP13 FN43 |
40.5 pygoat / qwen-3.5-397b-agentic-v1 F2 Score41.0 F3 Score40.5 Recall40.0% Precision46.2% TP28 FP32 FN42 |
23.6 pygoat / semgrep F2 Score21.8 F3 Score23.6 Recall25.7% Precision13.5% TP18 FP115 FN52 |
33.1 pygoat / snyk F2 Score31.9 F3 Score33.1 Recall34.3% Precision25.0% TP24 FP72 FN46 |
16.9 pygoat / sonarqube F2 Score18.2 F3 Score16.9 Recall15.7% Precision50.0% TP11 FP11 FN59 |
| python-app | 46.5 python-app / claude-haiku-4-5-agentic-v1 F2 Score48.2 F3 Score46.5 Recall45.0% Precision68.2% TP9 FP4 FN11 |
— | 83.5 python-app / claude-opus-4-6-agentic-v1 F2 Score83.6 F3 Score83.5 Recall83.3% Precision84.7% TP17 FP3 FN3 |
72.5 python-app / claude-opus-4-7-agentic-v1 F2 Score72.5 F3 Score72.5 Recall72.5% Precision72.6% TP14 FP6 FN6 |
72.8 python-app / claude-sonnet-4-6-agentic-v1 F2 Score73.9 F3 Score72.8 Recall71.7% Precision84.3% TP14 FP3 FN6 |
66.4 python-app / gemini-3.1-pro-agentic-v1 F2 Score67.9 F3 Score66.4 Recall65.0% Precision83.2% TP13 FP3 FN7 |
64.5 python-app / glm-5-agentic-v1 F2 Score65.7 F3 Score64.5 Recall63.3% Precision78.0% TP13 FP4 FN7 |
70.0 python-app / glm-5.1-agentic-v1 F2 Score70.0 F3 Score70.0 Recall70.0% Precision70.0% TP14 FP6 FN6 |
31.8 python-app / grok-3-agentic-v1 F2 Score33.9 F3 Score31.8 Recall30.0% Precision70.9% TP6 FP2 FN14 |
37.4 python-app / grok-4.20-reasoning-agentic-v1 F2 Score40.0 F3 Score37.4 Recall35.0% Precision94.4% TP7 FP0 FN13 |
58.0 python-app / kimi-k2.5-agentic-v1 F2 Score57.8 F3 Score58.0 Recall58.3% Precision55.9% TP12 FP9 FN8 |
33.1 python-app / kimi-k2.6-agentic-v1 F2 Score33.9 F3 Score33.1 Recall32.5% Precision40.6% TP6 FP10 FN14 |
55.8 python-app / kolega-v0.0.1 F2 Score48.9 F3 Score55.8 Recall65.0% Precision24.5% TP13 FP40 FN7 |
44.7 python-app / minimax-m2.7-agentic-v1 F2 Score46.1 F3 Score44.7 Recall43.3% Precision63.9% TP9 FP5 FN11 |
34.3 python-app / qwen-3.5-397b-agentic-v1 F2 Score35.3 F3 Score34.3 Recall33.3% Precision46.4% TP7 FP7 FN13 |
— | — | 21.4 python-app / sonarqube F2 Score23.0 F3 Score21.4 Recall20.0% Precision57.1% TP4 FP3 FN16 |
| python-insecure-app | 48.1 python-insecure-app / claude-haiku-4-5-agentic-v1 F2 Score50.6 F3 Score48.1 Recall45.8% Precision86.7% TP4 FP1 FN4 |
56.4 python-insecure-app / claude-haiku-4-5-v1 F2 Score59.0 F3 Score56.4 Recall54.2% Precision94.4% TP4 FP0 FN4 |
76.9 python-insecure-app / claude-opus-4-6-agentic-v1 F2 Score78.9 F3 Score76.9 Recall75.0% Precision100.0% TP6 FP0 FN2 |
56.2 python-insecure-app / claude-opus-4-7-agentic-v1 F2 Score58.5 F3 Score56.2 Recall54.2% Precision87.8% TP4 FP1 FN4 |
— | 39.5 python-insecure-app / gemini-3.1-pro-agentic-v1 F2 Score41.7 F3 Score39.5 Recall37.5% Precision75.0% TP3 FP1 FN5 |
52.6 python-insecure-app / glm-5-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP4 FP0 FN4 |
82.2 python-insecure-app / glm-5.1-agentic-v1 F2 Score83.2 F3 Score82.2 Recall81.2% Precision93.8% TP6 FP0 FN2 |
44.2 python-insecure-app / grok-3-agentic-v1 F2 Score47.1 F3 Score44.2 Recall41.7% Precision100.0% TP3 FP0 FN5 |
44.0 python-insecure-app / grok-4.20-reasoning-agentic-v1 F2 Score46.6 F3 Score44.0 Recall41.7% Precision100.0% TP3 FP0 FN5 |
50.8 python-insecure-app / kimi-k2.5-agentic-v1 F2 Score51.8 F3 Score50.8 Recall50.0% Precision62.4% TP4 FP3 FN4 |
55.8 python-insecure-app / kimi-k2.6-agentic-v1 F2 Score57.5 F3 Score55.8 Recall54.2% Precision76.7% TP4 FP1 FN4 |
72.2 python-insecure-app / kolega-v0.0.1 F2 Score61.4 F3 Score72.2 Recall87.5% Precision28.0% TP7 FP18 FN1 |
39.8 python-insecure-app / minimax-m2.7-agentic-v1 F2 Score42.3 F3 Score39.8 Recall37.5% Precision87.5% TP3 FP0 FN5 |
47.6 python-insecure-app / qwen-3.5-397b-agentic-v1 F2 Score49.6 F3 Score47.6 Recall45.8% Precision73.3% TP4 FP1 FN4 |
0.0 python-insecure-app / semgrep F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN8 |
13.3 python-insecure-app / snyk F2 Score14.3 F3 Score13.3 Recall12.5% Precision33.3% TP1 FP2 FN7 |
0.0 python-insecure-app / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN8 |
| pythonssti | 50.9 pythonssti / claude-haiku-4-5-agentic-v1 F2 Score51.9 F3 Score50.9 Recall50.0% Precision66.7% TP1 FP1 FN1 |
50.9 pythonssti / claude-haiku-4-5-v1 F2 Score51.9 F3 Score50.9 Recall50.0% Precision66.7% TP1 FP1 FN1 |
— | 100.0 pythonssti / claude-opus-4-7-agentic-v1 F2 Score100.0 F3 Score100.0 Recall100.0% Precision100.0% TP2 FP0 FN0 |
52.6 pythonssti / claude-sonnet-4-6-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
51.7 pythonssti / gemini-3.1-pro-agentic-v1 F2 Score53.7 F3 Score51.7 Recall50.0% Precision83.3% TP1 FP0 FN1 |
52.6 pythonssti / glm-5-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
100.0 pythonssti / glm-5.1-agentic-v1 F2 Score100.0 F3 Score100.0 Recall100.0% Precision100.0% TP2 FP0 FN0 |
52.6 pythonssti / grok-3-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
52.6 pythonssti / grok-4.20-reasoning-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
52.6 pythonssti / kimi-k2.5-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
84.2 pythonssti / kimi-k2.6-agentic-v1 F2 Score85.2 F3 Score84.2 Recall83.3% Precision100.0% TP2 FP0 FN0 |
66.7 pythonssti / kolega-v0.0.1 F2 Score50.0 F3 Score66.7 Recall100.0% Precision16.7% TP2 FP10 FN0 |
52.6 pythonssti / minimax-m2.7-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
52.6 pythonssti / qwen-3.5-397b-agentic-v1 F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
52.6 pythonssti / semgrep F2 Score55.6 F3 Score52.6 Recall50.0% Precision100.0% TP1 FP0 FN1 |
50.0 pythonssti / snyk F2 Score50.0 F3 Score50.0 Recall50.0% Precision50.0% TP1 FP1 FN1 |
0.0 pythonssti / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN2 |
| threatbyte | 33.6 threatbyte / claude-haiku-4-5-agentic-v1 F2 Score35.4 F3 Score33.6 Recall31.9% Precision62.7% TP8 FP5 FN16 |
21.3 threatbyte / claude-haiku-4-5-v1 F2 Score21.8 F3 Score21.3 Recall20.8% Precision27.2% TP5 FP13 FN19 |
60.4 threatbyte / claude-opus-4-6-agentic-v1 F2 Score61.0 F3 Score60.4 Recall59.7% Precision67.3% TP14 FP7 FN10 |
54.4 threatbyte / claude-opus-4-7-agentic-v1 F2 Score56.2 F3 Score54.4 Recall52.8% Precision75.7% TP13 FP4 FN11 |
54.5 threatbyte / claude-sonnet-4-6-agentic-v1 F2 Score56.3 F3 Score54.5 Recall52.8% Precision77.5% TP13 FP4 FN11 |
42.9 threatbyte / gemini-3.1-pro-agentic-v1 F2 Score44.2 F3 Score42.9 Recall41.7% Precision59.0% TP10 FP7 FN14 |
46.2 threatbyte / glm-5-agentic-v1 F2 Score48.0 F3 Score46.2 Recall44.4% Precision73.6% TP11 FP4 FN13 |
58.2 threatbyte / glm-5.1-agentic-v1 F2 Score59.5 F3 Score58.2 Recall56.9% Precision73.1% TP14 FP5 FN10 |
25.5 threatbyte / grok-3-agentic-v1 F2 Score27.8 F3 Score25.5 Recall23.6% Precision94.4% TP6 FP0 FN18 |
21.0 threatbyte / grok-4.20-reasoning-agentic-v1 F2 Score22.9 F3 Score21.0 Recall19.4% Precision82.2% TP5 FP1 FN19 |
45.3 threatbyte / kimi-k2.5-agentic-v1 F2 Score46.1 F3 Score45.3 Recall44.5% Precision54.4% TP11 FP9 FN13 |
59.8 threatbyte / kimi-k2.6-agentic-v1 F2 Score61.4 F3 Score59.8 Recall58.3% Precision77.8% TP14 FP4 FN10 |
79.7 threatbyte / kolega-v0.0.1 F2 Score70.5 F3 Score79.7 Recall91.7% Precision36.7% TP22 FP38 FN2 |
30.8 threatbyte / minimax-m2.7-agentic-v1 F2 Score32.7 F3 Score30.8 Recall29.2% Precision63.6% TP7 FP4 FN17 |
36.3 threatbyte / qwen-3.5-397b-agentic-v1 F2 Score37.9 F3 Score36.3 Recall34.7% Precision62.2% TP8 FP5 FN16 |
8.6 threatbyte / semgrep F2 Score8.8 F3 Score8.6 Recall8.3% Precision11.8% TP2 FP15 FN22 |
13.4 threatbyte / snyk F2 Score14.4 F3 Score13.4 Recall12.5% Precision37.5% TP3 FP5 FN21 |
0.0 threatbyte / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN24 |
| vampi | 54.4 vampi / claude-haiku-4-5-agentic-v1 F2 Score55.0 F3 Score54.4 Recall53.8% Precision61.8% TP7 FP4 FN6 |
43.4 vampi / claude-haiku-4-5-v1 F2 Score43.3 F3 Score43.4 Recall43.6% Precision43.3% TP6 FP8 FN7 |
68.5 vampi / claude-opus-4-6-agentic-v1 F2 Score67.8 F3 Score68.5 Recall69.2% Precision62.7% TP9 FP5 FN4 |
72.2 vampi / claude-opus-4-7-agentic-v1 F2 Score72.5 F3 Score72.2 Recall71.8% Precision75.6% TP9 FP3 FN4 |
83.3 vampi / claude-sonnet-4-6-agentic-v1 F2 Score82.1 F3 Score83.3 Recall84.6% Precision73.3% TP11 FP4 FN2 |
75.7 vampi / gemini-3.1-pro-agentic-v1 F2 Score74.6 F3 Score75.7 Recall76.9% Precision69.3% TP10 FP5 FN3 |
— | — | 51.8 vampi / grok-3-agentic-v1 F2 Score53.8 F3 Score51.8 Recall50.0% Precision77.1% TP6 FP2 FN6 |
40.9 vampi / grok-4.20-reasoning-agentic-v1 F2 Score43.5 F3 Score40.9 Recall38.5% Precision94.4% TP5 FP0 FN8 |
70.7 vampi / kimi-k2.5-agentic-v1 F2 Score69.8 F3 Score70.7 Recall71.8% Precision70.4% TP9 FP5 FN4 |
69.8 vampi / kimi-k2.6-agentic-v1 F2 Score70.5 F3 Score69.8 Recall69.2% Precision76.2% TP9 FP3 FN4 |
82.1 vampi / kolega-v0.0.1 F2 Score79.7 F3 Score82.1 Recall84.6% Precision64.7% TP11 FP6 FN2 |
67.0 vampi / minimax-m2.7-agentic-v1 F2 Score67.3 F3 Score67.0 Recall66.7% Precision71.2% TP9 FP4 FN4 |
48.3 vampi / qwen-3.5-397b-agentic-v1 F2 Score46.8 F3 Score48.3 Recall50.0% Precision37.1% TP6 FP11 FN6 |
0.0 vampi / semgrep F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN13 |
0.0 vampi / snyk F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP3 FN13 |
0.0 vampi / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN13 |
| vfapi | 56.6 vfapi / claude-haiku-4-5-agentic-v1 F2 Score57.7 F3 Score56.6 Recall55.6% Precision69.4% TP5 FP2 FN4 |
17.2 vfapi / claude-haiku-4-5-v1 F2 Score16.1 F3 Score17.2 Recall18.5% Precision10.6% TP2 FP15 FN7 |
87.0 vfapi / claude-opus-4-6-agentic-v1 F2 Score85.1 F3 Score87.0 Recall88.9% Precision72.7% TP8 FP3 FN1 |
86.3 vfapi / claude-opus-4-7-agentic-v1 F2 Score83.9 F3 Score86.3 Recall88.9% Precision68.7% TP8 FP4 FN1 |
82.4 vfapi / claude-sonnet-4-6-agentic-v1 F2 Score79.8 F3 Score82.4 Recall85.2% Precision63.9% TP8 FP4 FN1 |
67.0 vfapi / gemini-3.1-pro-agentic-v1 F2 Score67.2 F3 Score67.0 Recall66.7% Precision69.9% TP6 FP3 FN3 |
85.8 vfapi / glm-5-agentic-v1 F2 Score83.1 F3 Score85.8 Recall88.9% Precision70.2% TP8 FP4 FN1 |
93.0 vfapi / glm-5.1-agentic-v1 F2 Score90.0 F3 Score93.0 Recall96.3% Precision72.8% TP9 FP4 FN0 |
68.9 vfapi / grok-3-agentic-v1 F2 Score71.3 F3 Score68.9 Recall66.7% Precision100.0% TP6 FP0 FN3 |
57.9 vfapi / grok-4.20-reasoning-agentic-v1 F2 Score60.5 F3 Score57.9 Recall55.6% Precision94.4% TP5 FP0 FN4 |
86.9 vfapi / kimi-k2.5-agentic-v1 F2 Score79.2 F3 Score86.9 Recall96.3% Precision46.7% TP9 FP10 FN0 |
73.5 vfapi / kimi-k2.6-agentic-v1 F2 Score75.0 F3 Score73.5 Recall72.2% Precision100.0% TP6 FP0 FN2 |
81.1 vfapi / kolega-v0.0.1 F2 Score68.2 F3 Score81.1 Recall100.0% Precision30.0% TP9 FP21 FN0 |
60.3 vfapi / minimax-m2.7-agentic-v1 F2 Score59.9 F3 Score60.3 Recall61.1% Precision70.0% TP6 FP4 FN4 |
68.9 vfapi / qwen-3.5-397b-agentic-v1 F2 Score64.6 F3 Score68.9 Recall74.1% Precision45.0% TP7 FP9 FN2 |
12.2 vfapi / semgrep F2 Score13.5 F3 Score12.2 Recall11.1% Precision100.0% TP1 FP0 FN8 |
11.8 vfapi / snyk F2 Score12.5 F3 Score11.8 Recall11.1% Precision25.0% TP1 FP3 FN8 |
0.0 vfapi / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN9 |
| vulnerable-api | 47.5 vulnerable-api / claude-haiku-4-5-agentic-v1 F2 Score49.9 F3 Score47.5 Recall45.2% Precision87.8% TP6 FP1 FN8 |
61.1 vulnerable-api / claude-haiku-4-5-v1 F2 Score62.8 F3 Score61.1 Recall59.5% Precision80.6% TP8 FP2 FN6 |
72.3 vulnerable-api / claude-opus-4-6-agentic-v1 F2 Score73.2 F3 Score72.3 Recall71.4% Precision81.2% TP10 FP2 FN4 |
70.4 vulnerable-api / claude-opus-4-7-agentic-v1 F2 Score71.9 F3 Score70.4 Recall69.0% Precision86.7% TP10 FP2 FN4 |
69.4 vulnerable-api / claude-sonnet-4-6-agentic-v1 F2 Score69.7 F3 Score69.4 Recall69.0% Precision73.6% TP10 FP4 FN4 |
68.0 vulnerable-api / gemini-3.1-pro-agentic-v1 F2 Score69.3 F3 Score68.0 Recall66.7% Precision82.3% TP9 FP2 FN5 |
58.7 vulnerable-api / glm-5-agentic-v1 F2 Score60.3 F3 Score58.7 Recall57.1% Precision77.6% TP8 FP2 FN6 |
77.7 vulnerable-api / glm-5.1-agentic-v1 F2 Score76.9 F3 Score77.7 Recall78.6% Precision71.5% TP11 FP4 FN3 |
45.3 vulnerable-api / grok-3-agentic-v1 F2 Score47.9 F3 Score45.3 Recall42.9% Precision91.7% TP6 FP1 FN8 |
43.1 vulnerable-api / grok-4.20-reasoning-agentic-v1 F2 Score45.9 F3 Score43.1 Recall40.5% Precision100.0% TP6 FP0 FN8 |
53.7 vulnerable-api / kimi-k2.5-agentic-v1 F2 Score55.3 F3 Score53.7 Recall52.4% Precision74.6% TP7 FP3 FN7 |
62.8 vulnerable-api / kimi-k2.6-agentic-v1 F2 Score64.8 F3 Score62.8 Recall60.7% Precision89.4% TP8 FP1 FN6 |
80.2 vulnerable-api / kolega-v0.0.1 F2 Score70.7 F3 Score80.2 Recall92.9% Precision36.1% TP13 FP23 FN1 |
57.4 vulnerable-api / minimax-m2.7-agentic-v1 F2 Score57.8 F3 Score57.4 Recall57.1% Precision68.5% TP8 FP5 FN6 |
52.1 vulnerable-api / qwen-3.5-397b-agentic-v1 F2 Score54.4 F3 Score52.1 Recall50.0% Precision85.8% TP7 FP1 FN7 |
29.4 vulnerable-api / semgrep F2 Score30.3 F3 Score29.4 Recall28.6% Precision40.0% TP4 FP6 FN10 |
15.5 vulnerable-api / snyk F2 Score16.9 F3 Score15.5 Recall14.3% Precision66.7% TP2 FP1 FN12 |
0.0 vulnerable-api / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN14 |
| vulnerable-flask-app | 50.1 vulnerable-flask-app / claude-haiku-4-5-agentic-v1 F2 Score52.0 F3 Score50.1 Recall48.3% Precision75.9% TP10 FP3 FN10 |
32.3 vulnerable-flask-app / claude-haiku-4-5-v1 F2 Score33.0 F3 Score32.3 Recall31.7% Precision39.9% TP6 FP10 FN14 |
68.8 vulnerable-flask-app / claude-opus-4-6-agentic-v1 F2 Score69.4 F3 Score68.8 Recall68.3% Precision74.7% TP14 FP5 FN6 |
52.8 vulnerable-flask-app / claude-opus-4-7-agentic-v1 F2 Score54.0 F3 Score52.8 Recall51.7% Precision66.9% TP10 FP5 FN10 |
63.0 vulnerable-flask-app / claude-sonnet-4-6-agentic-v1 F2 Score64.4 F3 Score63.0 Recall61.7% Precision80.5% TP12 FP3 FN8 |
56.9 vulnerable-flask-app / gemini-3.1-pro-agentic-v1 F2 Score58.9 F3 Score56.9 Recall55.0% Precision82.6% TP11 FP2 FN9 |
66.9 vulnerable-flask-app / glm-5-agentic-v1 F2 Score69.0 F3 Score66.9 Recall65.0% Precision92.4% TP13 FP1 FN7 |
63.3 vulnerable-flask-app / glm-5.1-agentic-v1 F2 Score64.1 F3 Score63.3 Recall62.5% Precision71.6% TP12 FP5 FN8 |
24.9 vulnerable-flask-app / grok-3-agentic-v1 F2 Score26.6 F3 Score24.9 Recall23.3% Precision62.5% TP5 FP3 FN15 |
26.9 vulnerable-flask-app / grok-4.20-reasoning-agentic-v1 F2 Score29.2 F3 Score26.9 Recall25.0% Precision88.9% TP5 FP1 FN15 |
60.9 vulnerable-flask-app / kimi-k2.5-agentic-v1 F2 Score61.8 F3 Score60.9 Recall60.0% Precision70.8% TP12 FP5 FN8 |
57.8 vulnerable-flask-app / kimi-k2.6-agentic-v1 F2 Score57.6 F3 Score57.8 Recall58.3% Precision66.4% TP12 FP8 FN8 |
77.3 vulnerable-flask-app / kolega-v0.0.1 F2 Score67.7 F3 Score77.3 Recall90.0% Precision34.0% TP18 FP35 FN2 |
37.4 vulnerable-flask-app / minimax-m2.7-agentic-v1 F2 Score38.1 F3 Score37.4 Recall36.7% Precision46.1% TP7 FP8 FN13 |
42.8 vulnerable-flask-app / qwen-3.5-397b-agentic-v1 F2 Score44.0 F3 Score42.8 Recall41.7% Precision56.7% TP8 FP6 FN12 |
15.5 vulnerable-flask-app / semgrep F2 Score16.0 F3 Score15.5 Recall15.0% Precision21.4% TP3 FP11 FN17 |
25.9 vulnerable-flask-app / snyk F2 Score26.9 F3 Score25.9 Recall25.0% Precision38.5% TP5 FP8 FN15 |
0.0 vulnerable-flask-app / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN20 |
| vulnerable-python-apps | — | — | — | 69.9 vulnerable-python-apps / claude-opus-4-7-agentic-v1 F2 Score70.2 F3 Score69.9 Recall69.7% Precision73.6% TP15 FP5 FN7 |
— | — | — | 55.8 vulnerable-python-apps / glm-5.1-agentic-v1 F2 Score57.0 F3 Score55.8 Recall54.5% Precision70.9% TP12 FP5 FN10 |
— | — | — | 39.6 vulnerable-python-apps / kimi-k2.6-agentic-v1 F2 Score41.5 F3 Score39.6 Recall37.9% Precision72.4% TP8 FP3 FN14 |
65.8 vulnerable-python-apps / kolega-v0.0.1 F2 Score60.2 F3 Score65.8 Recall72.7% Precision35.6% TP16 FP29 FN6 |
— | — | 9.5 vulnerable-python-apps / semgrep F2 Score10.0 F3 Score9.5 Recall9.1% Precision16.7% TP2 FP10 FN20 |
9.6 vulnerable-python-apps / snyk F2 Score10.1 F3 Score9.6 Recall9.1% Precision18.2% TP2 FP9 FN20 |
19.2 vulnerable-python-apps / sonarqube F2 Score20.4 F3 Score19.2 Recall18.2% Precision40.0% TP4 FP6 FN18 |
| vulnerable-tornado-app | 51.7 vulnerable-tornado-app / claude-haiku-4-5-agentic-v1 F2 Score53.6 F3 Score51.7 Recall50.0% Precision76.8% TP7 FP2 FN7 |
29.8 vulnerable-tornado-app / claude-haiku-4-5-v1 F2 Score31.1 F3 Score29.8 Recall28.6% Precision48.6% TP4 FP4 FN10 |
65.0 vulnerable-tornado-app / claude-opus-4-6-agentic-v1 F2 Score65.7 F3 Score65.0 Recall64.3% Precision72.1% TP9 FP4 FN5 |
72.3 vulnerable-tornado-app / claude-opus-4-7-agentic-v1 F2 Score73.2 F3 Score72.3 Recall71.4% Precision82.8% TP10 FP2 FN4 |
57.6 vulnerable-tornado-app / claude-sonnet-4-6-agentic-v1 F2 Score58.0 F3 Score57.6 Recall57.1% Precision61.5% TP8 FP5 FN6 |
56.9 vulnerable-tornado-app / gemini-3.1-pro-agentic-v1 F2 Score59.3 F3 Score56.9 Recall54.8% Precision90.0% TP8 FP1 FN6 |
63.0 vulnerable-tornado-app / glm-5-agentic-v1 F2 Score64.1 F3 Score63.0 Recall61.9% Precision77.0% TP9 FP3 FN5 |
74.7 vulnerable-tornado-app / glm-5.1-agentic-v1 F2 Score75.6 F3 Score74.7 Recall73.8% Precision84.7% TP10 FP2 FN4 |
30.8 vulnerable-tornado-app / grok-3-agentic-v1 F2 Score33.3 F3 Score30.8 Recall28.6% Precision100.0% TP4 FP0 FN10 |
40.6 vulnerable-tornado-app / grok-4.20-reasoning-agentic-v1 F2 Score43.5 F3 Score40.6 Recall38.1% Precision100.0% TP5 FP0 FN9 |
53.6 vulnerable-tornado-app / kimi-k2.5-agentic-v1 F2 Score54.9 F3 Score53.6 Recall52.4% Precision70.9% TP7 FP3 FN7 |
54.2 vulnerable-tornado-app / kimi-k2.6-agentic-v1 F2 Score56.1 F3 Score54.2 Recall52.4% Precision78.5% TP7 FP2 FN7 |
88.1 vulnerable-tornado-app / kolega-v0.0.1 F2 Score78.7 F3 Score88.1 Recall100.0% Precision42.4% TP14 FP19 FN0 |
44.4 vulnerable-tornado-app / minimax-m2.7-agentic-v1 F2 Score46.2 F3 Score44.4 Recall42.9% Precision66.7% TP6 FP3 FN8 |
46.8 vulnerable-tornado-app / qwen-3.5-397b-agentic-v1 F2 Score48.4 F3 Score46.8 Recall45.2% Precision68.0% TP6 FP3 FN8 |
7.4 vulnerable-tornado-app / semgrep F2 Score7.7 F3 Score7.4 Recall7.1% Precision11.1% TP1 FP8 FN13 |
0.0 vulnerable-tornado-app / snyk F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN14 |
0.0 vulnerable-tornado-app / sonarqube F2 Score0.0 F3 Score0.0 Recall0.0% Precision0.0% TP0 FP0 FN14 |
| vulnpy | 48.5 vulnpy / claude-haiku-4-5-agentic-v1 F2 Score50.1 F3 Score48.5 Recall47.0% Precision88.4% TP37 FP7 FN41 |
46.6 vulnpy / claude-haiku-4-5-v1 F2 Score49.0 F3 Score46.6 Recall44.4% Precision83.3% TP35 FP7 FN43 |
72.4 vulnpy / claude-opus-4-6-agentic-v1 F2 Score73.6 F3 Score72.4 Recall71.4% Precision84.6% TP56 FP10 FN22 |
— | 69.4 vulnpy / claude-sonnet-4-6-agentic-v1 F2 Score71.3 F3 Score69.4 Recall67.5% Precision92.4% TP53 FP5 FN25 |
86.9 vulnpy / gemini-3.1-pro-agentic-v1 F2 Score87.4 F3 Score86.9 Recall86.3% Precision92.8% TP67 FP6 FN11 |
71.8 vulnpy / glm-5-agentic-v1 F2 Score73.6 F3 Score71.8 Recall70.1% Precision93.3% TP55 FP4 FN23 |
63.1 vulnpy / glm-5.1-agentic-v1 F2 Score64.7 F3 Score63.1 Recall61.5% Precision81.6% TP48 FP11 FN30 |
5.6 vulnpy / grok-3-agentic-v1 F2 Score6.2 F3 Score5.6 Recall5.1% Precision66.7% TP4 FP2 FN74 |
34.7 vulnpy / grok-4.20-reasoning-agentic-v1 F2 Score35.2 F3 Score34.7 Recall34.2% Precision94.2% TP27 FP5 FN51 |
69.7 vulnpy / kimi-k2.5-agentic-v1 F2 Score71.6 F3 Score69.7 Recall68.0% Precision91.4% TP53 FP5 FN25 |
87.8 vulnpy / kimi-k2.6-agentic-v1 F2 Score87.3 F3 Score87.8 Recall88.5% Precision84.3% TP69 FP14 FN9 |
62.6 vulnpy / kolega-v0.0.1 F2 Score62.3 F3 Score62.6 Recall62.8% Precision60.5% TP49 FP32 FN29 |
63.0 vulnpy / minimax-m2.7-agentic-v1 F2 Score65.0 F3 Score63.0 Recall61.1% Precision89.7% TP48 FP5 FN30 |
56.7 vulnpy / qwen-3.5-397b-agentic-v1 F2 Score59.2 F3 Score56.7 Recall54.5% Precision90.8% TP42 FP4 FN36 |
17.6 vulnpy / semgrep F2 Score18.7 F3 Score17.6 Recall16.7% Precision37.1% TP13 FP22 FN65 |
11.2 vulnpy / snyk F2 Score12.4 F3 Score11.2 Recall10.3% Precision72.7% TP8 FP3 FN70 |
7.1 vulnpy / sonarqube F2 Score7.9 F3 Score7.1 Recall6.4% Precision100.0% TP5 FP0 FN73 |
| vulpy | 19.4 vulpy / claude-haiku-4-5-agentic-v1 F2 Score21.2 F3 Score19.4 Recall17.9% Precision80.1% TP10 FP2 FN44 |
24.8 vulpy / claude-haiku-4-5-v1 F2 Score26.4 F3 Score24.8 Recall23.5% Precision53.2% TP13 FP11 FN41 |
50.5 vulpy / claude-opus-4-6-agentic-v1 F2 Score53.2 F3 Score50.5 Recall48.1% Precision91.6% TP26 FP2 FN28 |
31.3 vulpy / claude-opus-4-7-agentic-v1 F2 Score33.1 F3 Score31.3 Recall29.6% Precision61.5% TP16 FP10 FN38 |
34.4 vulpy / claude-sonnet-4-6-agentic-v1 F2 Score36.3 F3 Score34.4 Recall32.7% Precision65.0% TP18 FP10 FN36 |
36.6 vulpy / gemini-3.1-pro-agentic-v1 F2 Score38.8 F3 Score36.6 Recall34.6% Precision77.4% TP19 FP5 FN35 |
27.8 vulpy / glm-5-agentic-v1 F2 Score30.0 F3 Score27.8 Recall25.9% Precision81.2% TP14 FP3 FN40 |
35.8 vulpy / glm-5.1-agentic-v1 F2 Score37.8 F3 Score35.8 Recall34.0% Precision70.5% TP18 FP8 FN36 |
10.2 vulpy / grok-3-agentic-v1 F2 Score11.2 F3 Score10.2 Recall9.3% Precision86.7% TP5 FP1 FN49 |
9.5 vulpy / grok-4.20-reasoning-agentic-v1 F2 Score10.5 F3 Score9.5 Recall8.6% Precision86.1% TP5 FP1 FN49 |
25.0 vulpy / kimi-k2.5-agentic-v1 F2 Score26.8 F3 Score25.0 Recall23.5% Precision65.6% TP13 FP7 FN41 |
46.7 vulpy / kimi-k2.6-agentic-v1 F2 Score49.2 F3 Score46.7 Recall44.5% Precision87.0% TP24 FP4 FN30 |
78.5 vulpy / kolega-v0.0.1 F2 Score72.8 F3 Score78.5 Recall85.2% Precision46.0% TP46 FP54 FN8 |
37.4 vulpy / minimax-m2.7-agentic-v1 F2 Score39.9 F3 Score37.4 Recall35.2% Precision86.4% TP19 FP3 FN35 |
19.1 vulpy / qwen-3.5-397b-agentic-v1 F2 Score20.5 F3 Score19.1 Recall17.9% Precision50.3% TP10 FP10 FN44 |
22.6 vulpy / semgrep F2 Score23.0 F3 Score22.6 Recall22.2% Precision26.7% TP12 FP33 FN42 |
13.8 vulpy / snyk F2 Score14.8 F3 Score13.8 Recall13.0% Precision35.0% TP7 FP13 FN47 |
10.1 vulpy / sonarqube F2 Score11.1 F3 Score10.1 Recall9.3% Precision55.6% TP5 FP4 FN49 |
| AVERAGE (strict) | 37.2 Average (strict) Repos scored24 / 26 F3 (strict)37.2 F2 (strict)39.4 Recall35.2% Precision74.8% TP / FP / FN238 / 80 / 438 |
25.7 Average (strict) Repos scored23 / 26 F3 (strict)25.7 F2 (strict)27.0 Recall24.4% Precision47.7% TP / FP / FN165 / 181 / 511 |
47.7 Average (strict) Repos scored19 / 26 F3 (strict)47.7 F2 (strict)49.9 Recall45.6% Precision79.0% TP / FP / FN309 / 82 / 368 |
48.6 Average (strict) Repos scored25 / 26 F3 (strict)48.6 F2 (strict)50.3 Recall46.9% Precision71.2% TP / FP / FN317 / 128 / 359 |
51.7 Average (strict) Repos scored23 / 26 F3 (strict)51.7 F2 (strict)53.8 Recall49.9% Precision78.5% TP / FP / FN337 / 92 / 339 |
51.0 Average (strict) Repos scored24 / 26 F3 (strict)51.0 F2 (strict)53.0 Recall49.1% Precision77.4% TP / FP / FN332 / 97 / 344 |
45.8 Average (strict) Repos scored22 / 26 F3 (strict)45.8 F2 (strict)47.8 Recall43.9% Precision75.3% TP / FP / FN296 / 97 / 379 |
58.2 Average (strict) Repos scored25 / 26 F3 (strict)58.2 F2 (strict)59.6 Recall56.9% Precision73.4% TP / FP / FN386 / 140 / 292 |
21.3 Average (strict) Repos scored21 / 26 F3 (strict)21.3 F2 (strict)23.3 Recall19.7% Precision83.7% TP / FP / FN133 / 26 / 542 |
28.4 Average (strict) Repos scored24 / 26 F3 (strict)28.4 F2 (strict)30.7 Recall26.3% Precision92.7% TP / FP / FN178 / 14 / 498 |
46.6 Average (strict) Repos scored24 / 26 F3 (strict)46.6 F2 (strict)48.3 Recall45.0% Precision68.3% TP / FP / FN304 / 141 / 372 |
54.6 Average (strict) Repos scored25 / 26 F3 (strict)54.6 F2 (strict)56.4 Recall52.9% Precision77.1% TP / FP / FN357 / 106 / 318 |
73.0 Average (strict) Repos scored26 / 26 F3 (strict)73.0 F2 (strict)66.5 Recall80.9% Precision38.8% TP / FP / FN547 / 862 / 129 |
39.0 Average (strict) Repos scored22 / 26 F3 (strict)39.0 F2 (strict)41.0 Recall37.1% Precision70.7% TP / FP / FN251 / 104 / 425 |
38.1 Average (strict) Repos scored24 / 26 F3 (strict)38.1 F2 (strict)39.8 Recall36.6% Precision61.8% TP / FP / FN247 / 153 / 428 |
17.7 Average (strict) Repos scored25 / 26 F3 (strict)17.7 F2 (strict)18.0 Recall17.5% Precision20.5% TP / FP / FN118 / 457 / 558 |
17.4 Average (strict) Repos scored25 / 26 F3 (strict)17.4 F2 (strict)18.2 Recall16.7% Precision28.2% TP / FP / FN113 / 288 / 563 |
7.1 Average (strict) Repos scored26 / 26 F3 (strict)7.1 F2 (strict)7.9 Recall6.5% Precision61.1% TP / FP / FN44 / 28 / 632 |