CustomBench
A full-stack LLM benchmarking platform built with Next.js 16, React 19, TypeScript, and Bun, enabling concurrent evaluation of 10+ LLMs per run via OpenRouter. Features custom Q&A dataset support, real-time Server-Sent Events execution, automated LLM-as-judge evaluation with structured Zod schemas, and consolidated result reporting with leaderboard rankings.
The Challenge
Evaluating and comparing Large Language Models (LLMs) is a time-consuming and inconsistent process. Developers and researchers struggle to benchmark multiple models against custom test scenarios, often resorting to manual testing that doesn't scale. Existing benchmarking solutions lack real-time progress visibility, automated evaluation capabilities, and the flexibility to test against custom datasets that reflect real-world use cases.
The Solution
Built a full-stack LLM benchmarking platform that enables concurrent evaluation of 10+ models via OpenRouter. The platform features a dual-interface design supporting both a modern web UI for interactive workflows and a CLI for automation and CI/CD integration. Real-time Server-Sent Events (SSE) provide live progress tracking, while an automated LLM-as-judge evaluation system delivers consistent, structured verdicts using Zod schema validation.
Technical Highlights
- Architected real-time benchmark execution using Server-Sent Events (SSE) with 30-minute timeout safeguards for long-running evaluations
- Engineered automated LLM-as-judge evaluation pipeline achieving 100% structured verdict output through Zod schema validation
- Built concurrent model execution system supporting 10+ simultaneous LLM evaluations via OpenRouter API
- Designed comprehensive leaderboard system with accuracy metrics, winner highlighting, and reproducible JSON result exports
- Developed dual-interface platform with both web UI and CLI automation supporting research workflows and CI/CD integration
Key Results & Impact
Business Impact
CustomBench democratizes LLM evaluation by providing researchers and developers with a production-ready benchmarking platform. The tool accelerates model selection decisions, enables data-driven comparisons across providers, and integrates seamlessly into development workflows. It showcases expertise in real-time systems, API orchestration, and full-stack development with modern TypeScript patterns.
Key Achievements
Interested in Learning More?
Check out the source code or see the project in action