CustomBench

A full-stack LLM benchmarking platform built with Next.js 16, React 19, TypeScript, and Bun, enabling concurrent evaluation of 10+ LLMs per run via OpenRouter. Features custom Q&A dataset support, real-time Server-Sent Events execution, automated LLM-as-judge evaluation with structured Zod schemas, and consolidated result reporting with leaderboard rankings.

December 2025 - Present
CustomBench

The Challenge

Evaluating and comparing Large Language Models (LLMs) is a time-consuming and inconsistent process. Developers and researchers struggle to benchmark multiple models against custom test scenarios, often resorting to manual testing that doesn't scale. Existing benchmarking solutions lack real-time progress visibility, automated evaluation capabilities, and the flexibility to test against custom datasets that reflect real-world use cases.

The Solution

Built a full-stack LLM benchmarking platform that enables concurrent evaluation of 10+ models via OpenRouter. The platform features a dual-interface design supporting both a modern web UI for interactive workflows and a CLI for automation and CI/CD integration. Real-time Server-Sent Events (SSE) provide live progress tracking, while an automated LLM-as-judge evaluation system delivers consistent, structured verdicts using Zod schema validation.

Technical Highlights

  • Architected real-time benchmark execution using Server-Sent Events (SSE) with 30-minute timeout safeguards for long-running evaluations
  • Engineered automated LLM-as-judge evaluation pipeline achieving 100% structured verdict output through Zod schema validation
  • Built concurrent model execution system supporting 10+ simultaneous LLM evaluations via OpenRouter API
  • Designed comprehensive leaderboard system with accuracy metrics, winner highlighting, and reproducible JSON result exports
  • Developed dual-interface platform with both web UI and CLI automation supporting research workflows and CI/CD integration

Key Results & Impact

Enables concurrent evaluation of 10+ LLMs in a single benchmark run
Achieves 100% structured output compliance through Zod schema validation
Supports custom Q&A datasets for domain-specific model evaluation
Provides real-time progress tracking with SSE-based live updates
Delivers reproducible results through JSON export and leaderboard rankings

Business Impact

CustomBench democratizes LLM evaluation by providing researchers and developers with a production-ready benchmarking platform. The tool accelerates model selection decisions, enables data-driven comparisons across providers, and integrates seamlessly into development workflows. It showcases expertise in real-time systems, API orchestration, and full-stack development with modern TypeScript patterns.

Key Achievements

Built full-stack LLM benchmarking platform enabling concurrent evaluation of 10+ models per run via OpenRouter
Implemented real-time benchmark execution with Server-Sent Events (SSE) and 30-minute timeout safeguards
Engineered automated LLM-as-judge evaluation pipeline with 100% structured verdict output using Zod schemas
Designed leaderboard system with accuracy metrics, winner highlighting, and reproducible JSON result exports
Created dual-interface platform supporting both web UI workflows and CLI automation for research/CI use cases

Interested in Learning More?

Check out the source code or see the project in action