Kotlin creator's new language: talk to LLMs in specs, not English
251 points - today at 2:22 PM
SourceComments
Joel Spolsky, stackoverflow.com founder, Talk at Yale: Part 1 of 3 https://www.joelonsoftware.com/2007/12/03/talk-at-yale-part-...
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
- instead of modules, write specs for your modules
- on the first go it generates the code (which you review)
- later, diffs in the spec are translated into diffs in the code (the code is *not* fully regenerated)
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
Instead of imperatively letting the agents hammer your codebase into shape through a series of prompts, you declare your intent, observe the outcome and refine the spec.
The agents then serve as a control plane, carrying out the intent.
My quick understanding is that isn't really trying to utilize any formal specification but is instead trying to more-clearly map the relationship between, say, an individual human-language requirement you have of your application, and the code which implements that requirement.
* Yes, this is a language, no its not a programming language you are used to, but a restricted/embellished natural language that (might) make things easier to express to an LLM, and provides a framework for humans who want to write specifications to get the AI to write code.
* Models aren't deterministic, but they are persistent (never gonna give up!). If you generate tests from your specification as well as code, you can use differential testing to get some measure (although not perfect) of correctness. Never delete the code that was generated before, if you change the spec, have your model fix the existing code rather than generate new code.
* Specifications can actually be analyzed by models to determine if they are fully grounded or not. An ungrounded specification is going to not be a good experience, so ask the model if it thinks your specification is grounded.
* Use something like a build system if you have many specs in your code repository and you need to keep them in sync. Spec changes -> update the tests and code (for example).
I'm writing a language spec for an LLM runner that has the ability to chain prompts and hooks into workflows.
https://github.com/AlexChesser/ail
I'm writing the tool as proof of the spec. Still very much a pre-alpha phase, but I do have a working POC in that I can specify a series of prompts in my YAML language and execute the chain of commands in a local agent.
One of the "key steps" that I plan on designing is specifically an invocation interceptor. My underlying theory is that we would take whatever random series of prose that our human minds come up with and pass it through a prompt refinement engine:
> Clean up the following prompt in order to convert the user's intent > into a structured prompt optimized for working with an LLM > Be sure to follow appropriate modern standards based on current > prompt engineering reasech. For example, limit the use of persona > assignment in order to reduce hallucinations. > If the user is asking for multiple actions, break the prompt > into appropriate steps (**etc...)
That interceptor would then forward the well structured intent-parsed prompt to the LLM. I could really see a step where we say "take the crap I just said and turn it into CodeSpeak"
What a fantastic tool. I'll definitely do a deep dive into this.
The demo I've briefly seen was very very far from being impressive.
Got rejected, perhaps for some excessive scepticism/overly sharp questions.
My scepticism remains - so far it looks like an orchestrator to me and does not add enough formalism to actually call it a language.
I think that the idea of more formal approach to assisted coding is viable (think: you define data structures and interfaces but don't write function bodies, they are generated, pinned and covered by tests automatically, LLMs can even write TLA+/formal proofs), but I'm kinda sceptical about this particular thing. I think it can be made viable but I have a strong feeling that it won't be hard to reproduce that - I was able to bake something similar in a day with Claude.
And whatever codespeak offers is like a weird VCS wrapper around this. I can already version and diff my skills, plans properly and following that my LLM generated features should be scoped properly and be worked on in their own branches. This imo will just give rise to a reason for people to make huge 8k-10k line changes in a commit.
I've been working on this from the other direction — instead of formalizing how you talk to the model, structure the knowledge the model has access to. When you actually measure what proportion of your domain knowledge frontier models can produce on their own (we call this the "esoteric knowledge ratio"), it's often only 40-55% for well-documented open source projects. For proprietary products it's even lower. No amount of spec formalism fixes that gap — you need to get the missing knowledge into context.
I've had good success getting LLMs to write complicated stuff in haskell, because at the end of the day I am less worried about a few errant LLM lines of code passing both the type checking and the test suite and causing damage.
It is both amazing and I guess also not surprising that most vibe coding is focused on python and javascript, where my experience has been that the models need so much oversight and handholding that it makes them a simple liability.
The ideal programming language is one where a program is nothing but a set of concise, extremely precise, yet composable specifications that the _compiler_ turns into efficient machine code. I don't think English is that programming language.
I presume this is temporary since the project is still in alpha, but I'm curious why this requires use of an API at all and what's special about it that it can't leverage injecting the prompt into a Claude Code or other LLM coding tool session.
[0]: https://codespeak.dev/blog/greenfield-project-tutorial-20260...
Is it a code generator tool from specs? Ugh. Why not push for the development of the protocol itself then?
The other piece that has always struck me as a huge inefficiency with current usage of LLMs is the hoops they have to jump through to make sense of existing file formats - especially making sense of (or writing) complicated semi-proprietary formats like PDF, DOC(X), PPT(X), etc.
Long-term prediction: for text, we'll move away from these formats and towards alternatives that are designed to be optimal for LLMs to interact with. (This could look like variants of markdown or JSON, but could also be Base64 [0] or something we've not even imagined yet.)
Of course an expert would throw it out and design/write it properly so they know it works.
https://www.zmescience.com/science/news-science/polish-effec...
i guess you can build a cli toolchain for it, but as a technique it’s a bit early to crystallize into a product imo, i fully expect overcoding to be a standard technique in a few years, it’s the only way i’ve been able to keep up with AI-coded files longer than 1500 lines
You write a markdown spec.
The script takes it and feeds it to an LLM API.
The API generates code.
Okay? Where is this "next-generation programming language" they talk about?
Or Lojban?
However, there is no case for more complicated, multi-file changes or architecture stuff.
"[1] When computing LOC, we strip blank lines and break long lines into many"
This feels wrong, as the spec doesn't consistently generate the same output.
But upon reflection, "source of truth" already refers to knowledge and intent, not machine code.
Instant tab close!
Also, the examples feel forced, as if you use external libraries, you don't have to write your own "Decode RFC 2047"
So for example, if you refactor a program, make the LLM do anything but keep the logic of the program intact.
There you have it: Code laundering as a service. I guess we have to avoid Kotlin, too.
...which seems to suggest that the authors themselves don't dogfood their own software. Please tell me that Codespeak was written entirely with Codespeak!
Instead of that json, which is so last year, why not use an agent to create an MD file to setup another agent, that will compile another MD file and feed it to the third agent, that... It is turtles, I mean agents, all the way down!
I'm hoping for a framework that expands upon Behavior Driven Development (BDD) or a similar project-management concept. Here's a promising example that is ripe for an Agentic AI implementation, https://behave.readthedocs.io/en/stable/philosophy/#the-gher...
Does this make it a 6th generation language?
The site does describe it as a "programming language," which feels like a novel use of the term to me. The borders around a term like "programming language" are inherently fuzzy, but something like "code generation tool" better describes CodeSpeak IMHO.
Programming is in the end math, the model is defined and, when done correctly follows common laws.
I know dark mode is really popular with the youngens but I regularly have to reach for reader mode for dark web pages, or else I simply cannot stand reading the contents.
Unfortunately, this site does not have an obvious way of reading it black-on-white, short of looking at the HTML source (CTRL+U), which - in fact - I sometimes do.
Also, English is really too verbose and imprecise for coding, so we developed a programming language you can use instead.
Now, this gives me a business idea: are you tired of using CodeSpeak? Just explain your idea to our product in English and we'll generate CodeSpeak for you.