CustomBench

A full-stack LLM benchmarking platform built with Next.js 16, React 19, TypeScript, and Bun, enabling concurrent evaluation of 10+ LLMs per run via OpenRouter. Features custom Q&A dataset support, real-time Server-Sent Events execution, automated LLM-as-judge evaluation with structured Zod schemas, and consolidated result reporting with leaderboard rankings.

December 2025 - Present

View Code

TypeScriptReactNext.jsBunTailwind CSSOpenRouterZod

The Challenge

Details coming soon...

The Solution

Details coming soon...

Key Results & Impact

Business Impact

Details coming soon...

Key Achievements

Built full-stack LLM benchmarking platform enabling concurrent evaluation of 10+ models per run via OpenRouter

Implemented real-time benchmark execution with Server-Sent Events (SSE) and 30-minute timeout safeguards

Engineered automated LLM-as-judge evaluation pipeline with 100% structured verdict output using Zod schemas

Designed leaderboard system with accuracy metrics, winner highlighting, and reproducible JSON result exports

Created dual-interface platform supporting both web UI workflows and CLI automation for research/CI use cases

Interested in Learning More?

Check out the source code or see the project in action

Explore Source Code