HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No â 88
861 points - today at 1:44 AM
SourceComments
> temperature 0.1 â low, supposedly nudging the model toward deterministic outputs
This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"âbut is still very much a distribution).
To be clear:
- randomly filtering "too many" resumes is pretty much allowed (I think)
- but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select)
- this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation
- in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so
- for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".
As someone whoâs run hiring pipelines for technical roles in the past few years, thatâs actually a fantastic number. I objectively hate saying that, but itâs true.
35% chance of elevating a technical individual to the next stage with no effort? Iâve seen as many as 100+ applicants an hour even when including a domain specific screener question. Thatâs 35 âscreenedâ applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.
The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isnât involved. If you didnât apply immediately (using an AI bot) thereâs 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.
Referral bonuses exist for a reason.
For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.
And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.
Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that?
What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.
After a few runs it picked things up appropriately. I always got dinged on formal education though.
This stuff is gross.
Thatâs a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.
This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.
https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...
There is another name for it: a waste of electricity.
But wait, not waste! Consumers paid for it fully, with nice profit margins.
You and me, paid.
Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy.
It's all sad, to be honest.
BONUS POINTS: 5.0
------------------------------
Google Summer of Code (GSoC) participation: +5
Even though I've never done this, and don't claim to have done it in my CV.But logical inference itself is limited. You still have to find out if p is true or not - the ground truth.
How do you find that? You would be able to define in the prompt that if resume has p, infer q and do this. But determining the truth value of p is something LLM cannot do.
Itâs not a limitation of the LLM. Itâs the limitation of logic itself. You take 10 humans and give them the resumes with the same rubrics as the LLM. Youâll get a similar range of scores because everyone would assign different values.
The issue is not in logical inference. Itâs in determining the value of p, which takes much more than logic. And current LLMs are limited to being logical.
In my experience, cold-applying has always worked essentially as a black hole, and LLMs haven't changed that much. The reality is that alternative avenues are always necessary to get the job you want. That could be a third-party recruiter; reaching out to a hiring manager on LinkedIn; or using your network to get referrals. Those continue to work whether the company is using a bone-headed tool like this or not.
> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*
> - College, university, or educational institution name
> - CGPA, GPA, or academic grades
I don't understand why they would omit these factors from the evaluation.
Well done you! It is difficult to avoid architectural complexity, but imho well worth it.
Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.
I am sure thereâs some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but youâre always going to be trading off against confidence by doing this.
Beyond this, Iâd bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people donât understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics donât understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.
There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?
Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.
That inevitably leads to inconsistencies creeping in, so thereâs always an element of âluckâ (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.
So is that inconsistency better or worse with LLM screening? I donât know. But, at least, if itâs not worse maybe it doesnât matter for this specific use case. And if itâs notably better then maybe itâs raised the bar on what âgood enoughâ screening looks like?
(And Iâm sure other use cases warrant similar, âdoes it matter?â, questions, with the answers no doubt landing differently.)
This isn't to diminish the whispernet. Rather, it shows just how many important signals cannot be quantized.
I am not currently looking for employment, nor am I currently particularly worried about future prospects if I was suddenly in the position of looking for employment.
But if I ended up in a position with nothing to lean on but scattering my CV everywhere, wellâŠ
A lot of my major contributions are littered across the internet, private, or even just verbal/consultancy. They're things I did for free, in my spare time.
I also avoid GitHub. If you just look at my GitHub page for extra context, you would likely miss that delivering that very GitHub page likely involved a few bits of code I wrote.
Now, I could do a better job of trying to document this stuff, so it could be easier to find⊠But also I can't quite imagine how that would work.
> 30 for personal projects
These are insane weights for scoring a software engineer's resume.
> 35 points for open source contributions
> 30 for personal projects
I don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.
Well, I think I found your problem
Why is it so hard to write out an acronym once...
- Varies from 102.0/100 to 100.0/100
- Missed lots of OSS work
- Misinterprets GSoC work (Thinks projects I started that were contributed to in GSoC implies that I received a GSoC stipends)
- Areas for improvement seem to vary inconsistently (There's not enough project detail to there's too much project detail)
I still don't make company's cut offs ÂŻ\_(ă)_/ÂŻ
Is it possible the senior/principle jobs are not being applied to at a rate that LLM tools like this are required? Maybe star devs are getting recruiter referrals and this kind of tool is mostly used for filtering new grads?
Either way, perfectly dystopian.
Even better Wikipedia lists the abbreviation I am familiar with but give a different interpretation of the same words:
Is it working for anyone, on any level?
[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...
In no particular order:
1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.
2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:
> SCORING CRITERIA Open Source (0-35 points)
HIGH SCORES (25-35 points):
- Contributions to popular open source projects (1000+ stars)
- Significant contributions to well-known projects
- Google Summer of Code (GSoC) participation
- Substantial community involvement
Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...
- Are all contributions to open source projects with 1000+ stars equal?
- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.
- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?
Honestly at this point maybe someone should build a tool that scans prompts for adjectives...
4. This sort of thing is just asking for trouble:
> SCORES MUST NEVER DEPEND ON:
Candidate's name, gender, or personal demographic information
Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)5. Obviously this one depends on the LLM in question, but instead of writing things like:
> DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...
The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.
7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?
I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)
8. The prompt also seems unaware of what it's asking the LLM to do:
> LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores
This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.Your resume's reception is always affected by random factors, only now you are able to test, debug and technically critique the randomness.
While resume's are being filtered left and right, they just make TikTok's on company's dime [1]. What a sad state of affairs.
It took more time than if I just reviewed the 40 CVs myself, but that was an experiment, and I think it shows the AIs can be trained on your comments. And if there is enough training and a good knowledge system that allows AI to apply the learning in those trainings, it can eventually become a lot more accurate at this task?
Typically, retrieval should be tied to evaluation metrics, evidence should be linked to scores, and you also need to account for parsing errors.
But personally, I'm weak to these kinds of ATS systems (ugly appearance, non-native English speaker, didn't go to a good university), so if this kind of filtering existed, I probably would have never had a job in my entire life. Come to think of it, even now I don't have a proper jobâI just bid on projects at the lowest price and implement them. So maybe it doesn't really matter whether such a system exists or not
Hooray for incidental non-determinism.
The scoring is out of 100, with up to 20 bonus points on top:
35 points for open source contributions
30 for personal projects
25 for work experience
10 for technical skills
Up to 20 bonus points for startup experience, a portfolio site, a technical blog, etc.
All the AI is doing is trying to sus out the candidates portfolio which is really what we should be submitting when we apply for a position instead of being forced to somehow condense it to a set of BS business-speak bullet points. Especially when employers are now deploying AI systems just to figure out what's in a candidate's portfolio to begin with.
When all you have is a hammer every problem is a nail. The process itself is broken. We need to kill the outdated concept of resumes before it kills the industry.
Sounds like they have replicated the existing recruitment process
You see a lot of frameworks for things like spec-driven development make use of scoring how good the spec/design/plan is and itâs like, uhhhâŠ
https://github.com/interviewstreet/hiring-agent/blob/main/pr...
One of the weird properties of other people using LLMs is the potential of having oracle access to your opponent. Even if you don't have their exact LLM a good guess at it may be a better model of the opponent than you ever had before.
> LLM is called six times to extract structured information
Followed by
> The default model is gemma3:4b, running at temperature 0.1 â low, supposedly nudging the model toward deterministic outputs.
This is exactly why hiring is even more broken: Because the people looking for candidates are also just as unqualified if not, more.
Using much weaker LLMs to replace the person in charge of the final judgement call is the wrong solution as this is a plain old social problem.
Even if you wanted to use LLMs for this case, the default configuration, model choice is laughably flawed. This LLM canât be trusted as it doesnât even know what it is reading.
The correct solution is either advanced OCR with keyword ranking with a basic filter or a far stronger LLM that excels at document / vision parsing benchmarks with an experienced person making the final judgement call in case the technology misses a critical detail.
Rather than using this less accurate one that hallucinates out its decision depending on a dice roll.
Speculative thought only, of course.
But the big question I want to know is "Why did I score that?" And these slop machines absolutely cannot explain anything. That's the root problem with LLMs as a whole. There is no way to describe WHY an llm makes a decision.
Was it because they are a woman? Does the woman's name have more pregnancies than other names? Was it because their job history make the person older (over 40)? Is the person black or black name? Is the name or address attributed to higher criminal tendencies?
But no, you font get to know ANY of that. Slop machine says 66/100 , if you're lucky to even get a number. Usually its a 30 second rejection, or rejection at GMT 0:00 when the batch is processed and you summarily failed.
Hmm, well, maybe a bit with a nuance of elite class structure reproduction (that doesnât prevent a few transclass to showcase in case anyone critic the perfect meritocracy at run), thatâs basically what people get, so crude truth but truth nonetheless.
Oh donât take it personally. Your own bespoke hand-tailored process of course is different, it does give the opportunity to everyone to reach the most accomplished version of themselves beyond what they ever dare to dream.
It wonât help though with the systematic failure of aiming to provide an accessible path to flourish for everyone and letting no one behind.
Again, this is no fault of any specific player, but as long as a majority feel compelled to move within the frame of the game with few winners that merit all they got in contrast to large stock of inept losers, the outcomes are no wonder.
https://mathpix.com/careers/apply
Then internally we have dashboards and sorting based on AI agent scoring. I noticed the scoring is imperfect but still saves a lot of time. Candidates scored at or below 2/5 are reliably bad and candidates above 4/5 are consistently impressive and leave thoughtful answers.
The biggest thing is not using resumes. You canât reliably gage applicants without a writing sample and resumes are the worst form of writing sample. Also you need to be intentional about who youâre hiring for, both to craft the questions as well as grade the responses.
There's no better feeling than building something open source and watching it take off. Nine months ago, I built a simple hiring agent to solve one very real problem.
Things it is not: It's not an ATS. We don't use it to screen our open roles. Our customers don't use it either.
Here's what it is: Every year at HackerRank, we get 50,000 to 60,000 intern applications. No human can read that many resumes well. So I built something to rank them, helping me decide which resumes to read first.
[This was before we built AI Interviewer (Chakra) to automate the first round of interviews, so candidates are no longer rejected based on their resumes alone.]
Two things worth clarifying since I've seen them come up in this thread:
The default model is gemma3:4b because it's what runs locally on most laptops - no cloud API needed. Actual resumes are evaluated using a top Gemini model. The repo ships with a demo config, not the production one.
The cutoff score was set very low â the system was designed to rank resumes, not reject them. Only resumes at the very bottom of the distribution were filtered out. The vast majority passed through to human review, where the real decisions were made.
Over the last week, it's taken on a life of its own. People are cloning it, running their own resumes through it, opening issues, sending PRs.
I contributed to open source a lot in college. Somewhere along the way, I drifted away from it. This week reminded me how good that feeling is. This thread has also given me more ideas than I expected. The critiques here are sharp and I'm already thinking about how to act on them. Improvements are coming.