You can't unit test for taste

195 points - yesterday at 8:54 AM

Comments

trjordan today at 1:18 PM

You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.

Except that I can't fully externalize myself. Debugging a system takes more resources than running the system. If I could write down everything I know and hand it to a machine, I'd do that, but it impossible.

People aren't books or hashmaps. If you want to build something, you need to use the tools, not teach the tools to use you.

[edit: I'm trying to figure out if there's something to be done about this. Email me if you want to chat -- tr at tern dot sh]

jt2190 today at 5:45 PM

> Overall the evaluation of success was one of the most challenging parts of the project. As a developer, I’m used to building features that either work or don’t and there is often an objective way to measure how well a feature performs. For messy real world data it was hard to evaluate how good or bad the pipeline was. Furthermore, it was easy to start optimising for a specific parameter or route and find later that this work led to severe degradations in other areas.

> Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste. I’m sure these are familiar challenges to data scientists and that there are frameworks and evals for working on them. This will require more iteration and manual overrides. Hopefully with feedback and collaboration from the community. But for now I’ve shipped V1…

I suspect LLMs may be able to help us quantify our taste because they can keep track of so many data points all at once, where we have to lossily abstract these details away.

HoldOnAMinute today at 6:07 PM

I am quite confident I could take a series of photos of various designs and classify them as "tacky" or not, and train a neural network to recognize tackiness.

zamalek today at 3:40 PM

Unrelated to code, but along the same lines. I've been keeping track of the Reckless Ben case to fuel my unhealthy indignation, and we just had a like-for-like comparison between a human and an LLM.

Human: well-scoped argument that does just enough to get the job done with minimal risk.

AI: Extremely clever and correct legal argument that almost any lawyer would have said not to file (at least as written). It tries to burn the world and seriously risks pissing off the judge.

https://www.youtube.com/watch?v=YRXJnKP6Tu0

Gosper today at 1:55 PM

Language count is a decent notoriety signal though pretty coarse. The OP/author should take a look at QRank: https://qrank.toolforge.org/

> QRank is a ranking signal for Wikidata entities. It gets computed by aggregating page view statistics for Wikipedia, Wikitravel, Wikibooks, Wikispecies and other Wikimedia projects

from https://github.com/brawer/wikidata-qrank/blob/main/doc/desig...

pjmlp today at 2:18 PM

Exactly one of the reasons I never went down with all the TDD dogma of only writing code to fix broken tests.

There is a reason conference talks are always about plain algorithms and data structures.

timroman today at 2:05 PM

https://pureinference.com/insights/taste-is-the-new-skill

I wrote about this a few months back. Rick Rubin is famous for this. I do think it is something that can be trained though, it just needs a lot more context. Taste builds over time through lots of unit tests, through lots of content writing, through an accumulation of product decisions. It’s hard to put it in the individual spec, but it can be teased out of 100 project specs. And when you get to that scale the AI starts to do it pretty well.

ChrisMarshallNY today at 2:30 PM

> but it ended up merely in a supporting role

This has been my experience, as well, but it’s a really big support. It just needs adult supervision. I can’t understand how vibe-coded apps, actually work.

As far as “taste,” goes, I test my stuff constantly, checking for even minor “friction points,” sometimes, refactoring back to design, in order to resolve issues that many folks would ship. I’m pretty anal, and want my work to be the best experience possible.

I can’t see any LLM coming close to being able to evaluate the user experience, like I can.

TimXare today at 1:41 PM

Taste is mostly the part of the spec you forgot to write down, plus the part you couldn't write down even if you tried.

layer8 today at 4:22 PM

You can’t even unit-test for correct program logic, unless you’re able to enumerate all possible inputs and states within a short time frame.

fotoblur today at 2:58 PM

No but you can add selection as part of your workflow. Governance is something AI agents have allowed me to focus on more and more and this IMHO is where taste lands for me: https://github.com/lramoth/infoPipeline/blob/main/governance...

chantepierre today at 1:15 PM

It makes me smile when runners use "X is a marathon, not a sprint" to hint at an effort that accumulates over time and an optimal use of energy.

I do it too because it's a common expression, and a marathon is of course longer than a sprint, but both have in common that properly raced, they are absolutely brutal efforts that leave you without a single additional drop at the end. The effort length and instantaneous power output changes, of course. Maybe "it's a marathon build, not the race" would be more precise at the loss of nearly all its expressive power (but with a lot more pedanticism points) :-p .

Nice project !

thomasfl today at 2:49 PM

That's what linters are for. Linters can prevent SQL code from spilling out to code outside the model layer. Even more important when vibecoding.

a_c today at 1:38 PM

I like to think of testing as making sure things not wrong, but not making it right.

Working, useful, delightful, in that order. Testing can make things more likely to work, that's it.

deleted today at 1:37 PM

jpadkins today at 2:27 PM

I think another important question is can you distill taste? (another comment uses the phrase "externalize", which might mean something similar).

I think people have been trying for the written word, with some degree of success (anti-slop skills). I have been trying for visuals, and it's pretty meh. It's easy to get a multimodal LLM to follow a style guide, but a style guide doesn't capture everything that accounts for taste. And anything that is dynamic (not a screenshot test) seems really hard or really expensive.

tuo-lei today at 2:47 PM

the taste part for me is cutting what the agent generated. 200 lines come back, i keep 80, no test for which 80.

carra today at 2:03 PM

So now we need a framework for unit tastes

deleted today at 3:34 PM

ddemian today at 4:40 PM

[dead]

TestINGNG today at 1:55 PM

[dead]

draw_down today at 1:38 PM

[dead]

esafak today at 1:47 PM

We can encode taste -- generative AI depends on it. Ask people to compare two examples and pick the one with better taste. You can even ask them to rate multiple subjective criteria at once. Use that to learn a scoring function based on the rating labels, and raw features. Now you can write tests.

throw93949444 today at 1:11 PM

> For example, my native Iceland had a nice mix of nature, historical sites and populated places.

You absolutely can unit test for taste, just put an agent into loop, and write into prompt what you like. Then do scoring...

Iceland is really bad example, it basically has one populated site (capital) and circular road that goes around the island.