Show HN: EvalsHub: Your AI is failing in production and you don't know it

3 points - yesterday at 6:34 PM

I was tired of stitching together Langfuse for tracing, promptfoo for red teaming and evals, and custom scripts for CI/CD. It was a mess so I built EvalsHub.

EvalsHub does all of it in one place. Automatic production scoring, red teaming, prompt versioning, and CI/CD integration. Zero to full eval coverage in 30 minutes.

Would love brutal feedback from anyone shipping AI in production.

evalshub.ai

Source

Comments

AgentTax yesterday at 9:31 PM

The consolidation angle makes sense — the Langfuse + promptfoo + custom scripts stack is genuinely painful. The question I'd ask is whether the tradeoff is worth it. Each of those tools is deep in its specific domain. What does EvalsHub sacrifice to cover all three, and where does it still defer to specialists? Also curious how you handle the rubric quality problem. LLM-as-a-judge is only as good as the criteria — do you have tooling to help teams know when their rubrics are underspecified?