Hacker Newsnew | past | comments | ask | show | jobs | submit | proxysna's commentslogin

Alpine is my go-to nowadays for everything in my homelab except desktop (I use Void btw), because of how dirty the setup to make GPU's work with musl kernel.

Looks really nice, but 10 fps in Firefox.

Buttery smooth for me in Firefox (mac)

It's just a DGX spark with faster memory and a windows boot?

There is a docker registry under the same name. https://zotregistry.dev


I think hashicorp still have an implementation for vaults seal/unseal process. Unless something changed ofc


They still do indeed.


Feels about right.

I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.

With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.


I rage canceled Claude today.

After 2 weeks of Claude getting progressively worse and worse, today was the final straw.

I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".

I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.


I've seen a lot of this sentiment over the previous six months from people on reddit. I have yet to experience this myself as a developer with over 20 years of experience.


As always, I think this happen more to vibe coder. They don't understand that bigger project means worse AI performance. On top of that Opus felt being nerfed at understanding prompt so if your spec is bad you won't get good result.


What it does seem like is that they're tuning some knobs up and down or releasing new versions of models or system prompts that result in the model getting dumber and smarter in waves.

Opus has been dumb this week.

Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.

It could also just be luck and my impressions are false... who knows.


It’s because it’s not true, there’s no evidence for it that passes the sniff test. No lab is “shipping a worse model once they’ve got you”. People have a bad few days and blame the model providers instead of stepping back to fix their workflow.


When it comes to something with random results (unfortunately that's what LLMs are), people will think the odds are rigged against them.

It's a good thing that hype-chasers are cancelling though. So we can use the services with a reasonable latency.


Opus 4.7 has been a real downgrade for me. I’m back to mid 2025 when I had to catch all the completely intermediary goals/assumptions the model is creating for itself


You can still use older versions of Opus if they work better for you. Just need to set the environment variable.


I felt that but find it worked way better by invoking it with `claude --effort max` only


Yup I run Max only now. It's closer to the "old" Claude but still nerfed in my opinion.


My "mental scratchpad" needs to be as sharp as possible to maximize my intelligence. I think of the LLM as a scratchpad for my thinking, I hope the Anthropic team can see this.


it's sort of good at thinking, writing specs, etc.. Also debugging. But as a coder: I see no advantage to opus 4.6 and I preferred sonnet most times already over opus 4.6.


I see a lot of the "4.7 is a downgrade" sentiment. 4.7 does (mostly) what you ask it to do. 4.6 does what it thinks it should do. As someone with 20 years writing my own code I want the former, but the loud contingent online wants the latter.

When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.


I can tell you for a fact, Claude 4.7 was NOT doing what I told it to do (in fact the clear and complete opposite - repeatedly), a pretty simple architectural refactor, and that Codex did better and DeepSeek much better.

It was given very simple ways to verify success. It simply didn't do that and said it's at a good stopping point, despite moving in the WRONG direction not even doing 1% of the task, and being told to see the task through to completion.

Meanwhile, Codex broke it down into 3 steps and just got it done...

No, "I'm going to give it to you straight, this is a large risky commit that could go sideways, so I'm just not going to do anything instead."

Claude worked on it for almost 200 commits over 2 weeks, needing to typically prompt it 3x to even TRY to make any progress instead of just wasting tokens to ignore me and tell me how big and risky it is.

Maybe Claude is just particularly terrible at this type of refactor. I'm not sure why that would be.


It's the same phenomenon as when you learn a new vocabulary word you see it everywhere.

People heard "Claude is nerfed" and now they see it everywhere, they notice failures a lot more than they would have otherwise.

Doesn't matter that Claude is not, in fact, nerfed. Perception is powerful and most humans are not rational.


Oh Opus is nerfed sure, but not that hard. Early this year opus 4.6 can understand your prompt and your intention easily, it got worse around mid April. Opus 4.7 even worse than that.

However that's just it, you just need to improve and make clearer of your prompt and it will perform just as good.


Or just switch over to OpenAI. Codex-5.5 is quite good.


This account is an LLM-hype peddler, shilling for Anthropic (check comment history). If they say that Claude is not nerfed, then most likely it is, in fact, nerfed.


I wouldn't call correcting misinformation and FUD "peddling hype" or "shilling" but I suppose we are in a post-truth world, where if you push back against the anti-AI emotions and vibes with grounded facts, you must be a shill.

Anyways, please take your discourse of calling people you disagree with "shills" back to Reddit. I'd much rather engage with someone debating the merits of an argument.


If you are an LLM-hype peddler, you really should not be offended at being called out. Also, this is the merit you are ostensibly looking for — since you are a shill, everyone should know this first before taking your words seriously.

You should also check your LLM prompt for HN comments, because the original comment you replied to was not anti-AI, and, in fact, very much pro-AI. The only criticism it had was about model being degraded, so they could not go as hard at AI-assisted development anymore as they used to before. I guess it's a bit difficult for LLMs to spot the difference and make proper conclusion for now.

Also even if taking you seriously — how does writing "no, model performance is not degraded because I say so" serve as correcting misinformation? It only does if you are shilling for Anthropic (which you do), otherwise it's just hot air.


Not offended at all, but just ranting about how someone is a shill instead of responding to the substance of their argument is simply not the kind of discussion we have on HN. Read the guidelines.

> "no, model performance is not degraded because I say so" serve as correcting misinformation?

Because zero evidence has been provided other than feelings. That is not evidence of degradation, and we know they don't serve quants.


You are an Anthropic shill, and this is an explicit marker that needs to be added to all of your comments, so that all information you provide can be adjusted for that bias. But I do understand why you ignore this point since it devalues all your comments (as it should), and instead cling to "ranting how someone is a shill bla-bla-bla".

Those people, unlike you, are actually using AI in development. And it is not a singular person who reports their frustration with the model being degraded after a certain period of time, so the anecdata does gradually become data. Your attempts at gaslighting are weak, you should really ask your bosses for a new guidebook on how to deal with reports of models performing at worse levels than before. Just writing "because I say so" is not cutting it.

> "we know they don't serve quants"

How do you know that unless you are working at Antrhopic? Yet another evidence of you being an Anthropic shill.


You have no substantive arguments other than calling people you disagree with shills.

> so the anecdata does gradually become data.

No, it does not. Countless social phenomena demonstrate how factually incorrect misconceptions spread rapidly. Frequency illusion is real and contagious.

> How do you know that [they are not serving quants]

Lots of ways to tell, if you weren't busy calling people shills.

First, Anthropic and OpenAI have both stated they don't serve quants. Weak protection, but it's there.

Second, no one has shown an A/B or eval proving a regression.

Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution. Despite having access to this data, no one has any evidence proving a quant has been served.

> You are an Anthropic shill

I'd explain the reasons I favor Anthropic over the others, but you'd just go back to yelling "shill" instead of engaging in a real conversation. That said, I am a fan of GDM as well, and think Gemini is better than Anthropic for everything other than code.

I've seen nothing resembling sane, reasoned thought from you in this thread. Just anger.

You haven't substantively debated a single point, it's like "shill" is the only word in your vocabulary. Again, this isn't Reddit.


Nothing to do with disagreement, I only call "Anthropic shills" people who are explicitly and shamelessly shilling for Anthropic. You still ignore the point that shilling adds bias to all your comments, so other readers have to actively keep it in mind to adjust for it. Stating that you are an Anthropic shill helps everyone around. And somehow you managed to be peddling LLM-hype shit so hard, that you are the only one called out on that by me.

> No, it does not.

Yes, it does, it is literally the definition of data - collection of points, observations, anything really. Try gaslighting harder, Anthropic shill. As I said, ask for better playbook on how to deal with people actually experiencing degradation before replying again.

> First, Anthropic and OpenAI have both stated they don't serve quants.

What's the point of stating this other than trying to pad your baseless "proof"? LLM-level argument.

> Second, there have not been evals showing a real regression test proving that a quant was served

This is how I know you have no idea what you are talking about and resort to LLMs for all your argumentation. Benchmarks are gamed so hard that even quantized models would achieve on them non-quantized level reliably. Moreover, benchmarks (that matter) are not run continuously all the time.

> Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution. Despite having access to this data, no one has any evidence proving a quant has been served.

You really are an LLM. What do you think different token distribution means? It literally means different, arguably worse performance in coding tasks. The evidence is in your face, but you have to keep it straight, since you are an Anthropic shill. You wrote yourself an argument why the models ARE quantized over time and did not even understand it. Makes sense, since you are paid to not understand stuff but peddle LLM-hype for Anthropic instead.

> I'd explain the reasons I favor Anthropic over the others

It is perfectly visible why you favor Anthropic, because you are an Anthropic shill and they pay you your salary, duh.

> real conversation

This is the type of conversation everyone should have whenever they read something written by an Antrhopic shill. You are actively poisoning this forum by astroturfing for Antrhopic, so we should take measures against it.

> You haven't substantively debated a single point

Obviously an Anthtropic shill would ignore everything of substance I wrote and instead focus on being called out. Fortunately, it is not you who I have to convince of anything, since your very well-being relies on getting salary from Anthropic peddling LLM-hype on HN and elsewhere, so you are physically incapable of understanding pretty much anything that contradicts your talking points.


> Yes, it does, it is literally the definition of data

No, feelings are not reliable data when frequency bias and misinformation exist. There is a reason most experiments isolate out bias as much as possible.

> Moreover, benchmarks (that matter) are not run continuously all the time.

So there's no data?

> What do you think different token distribution means?

You clearly did not understand anything I said. Stated simply: If you were being served a quant, you'd be able to tell by looking at the token distribution, latency, and TPS. You don't need to trust the labs' word for it.

> they pay you your salary, duh.

In fact, I get paid by a FAANG, though I do use Anthropic products heavily. Further, I don't really need money, I have more than enough. So much for reading my history.

> You are actively poisoning this forum

Your degenerate discussion - calling people shills instead of engaging with the argument, insulting them when your arguments are disproven, your inability to hold a rational debate that's not angry and emotionally charged - that is what is poisoning this forum.

Frankly, if you react this angrily and emotionally to a simple rational premise (that frequency bias leads to the perception of models being worse than them actually being worse), you're ngmi unless you're already independently wealthy.

I would recommend a therapist, it helped me when I had similar behavioral issues. (Claude is a great therapist, by the way ;)


> Feelings

Nice gaslighting, Anthopic shill. No one said a word about feelings, only you (to derail the conversation). People reported their own experience and frustration with the model being unable to complete tasks they previously could. I said, get a better playbook before coming back. Or is it the best LLMs can do for now? Sad, then.

> No data

There is data, which you try to gaslight into being "feelings", Anthopic shill.

> Stated simply: If you were being served a quant, you'd be able to tell by looking at the token distribution, latency, and TPS.

Did you just repeat what you said before while ignoring the actual meaning of the words and my explanation of what YOU wrote? Is it what LLM told you to do, Anthropic shill? And you claim I have no substance. Maybe spend a week or so getting educated, before blindly copying and pasting LLM output, Anthropic shill?

> I get paid by a FAANG

Yeah, in your dreams maybe, Anthropic shill. I did read your comment history, and this is likely part of the story you try to build around your Anthropic shilling persona. Not a single fact that would prove that and believe me, I tried looking for it. Only endless claims of "I work at a FAANG" (no one who actually works here writes it like this).

> I use Anthropic products heavily

This is obvious, as 90% of your comments are LLM generated, Anthropic shill.

> calling people shills

Clanker, I called only you a shill, not people, tell your LLM to update its context. And I called you shill not because of any arguments, but because of your comment history unapologetically shilling for Anthropic and peddling LLM hype.

> arguments are disproven

You ignored half of my arguments, and for the rest you just repeated what you wrote before, not even understanding what the words you typed meant. Nice gaslighting, Anthropic shill.

> insulting

And you said you were not offended. Once again, Anthropic shill, being called a shill is not an insult. This is your fate, to be called an Anthropic shill, while you are on their payroll, astroturfing online communities with your LLM-bullshit peddling. Or do you expect being a propagandist to be a pleasant experience? People with no morals like you coming into this forum spreading their employer's bullshit deserve all the hate they get and more.

> you're ngmi. Hope you're already independently wealthy.

Your LLM outputs the same thing as in other comment for no good reason. Can't Anthropic afford good models for its shills, or is it the best SOTA can do now?

I would recommend you abandon this account, because it's now burned for all shilling intents and purposes.


Again, you're just interpreting anything that goes against the "AI bad" grain as shilling.

> There is data

Please show it.

> while ignoring the actual meaning of the words

It was an incoherent mess of insults, so I am still not sure what you're trying to say.

> Yeah, in your dreams maybe

So now I'm lying about my employment on an anonymous forum for... what, exactly? If you are actually this conspiratorial IRL, get help.


> Again, you're just interpreting anything that goes against the "AI bad" grain as shilling.

Once again, putting "AI bad" into my words. No, Anthropic shill, this is not what I am saying. Is your LLM malfunctioning or are you not really getting it? Stop gaslighting, Anthropic shill, and try to stick to the actual words I am saying. I understand that this is hard for you, because then you would have no real argument to make, but please try, Anthropic shill.

> Please show it.

I used an LLM to count actual experiences of people reporting their experience with Opus 4.6 being degraded. There are literally several hundreds of such data points. This is data. People, who are employed and actually use LLMs for coding, unlike you, Anthropic shill, who uses it only to poison online communities. Are you really going to disregard all that to claim it is mass-psychosis or something? I guess you would, Anthropic shill, because that's your job, to peddle bullshit LLM-hype unbased on anything in reality.

> It was an incoherent mess of insults, so I am still not sure what you're trying to say.

Repeat after me, Anthropic shill: being called a shill is not an insult. You are a shill, stop being obtuse and at least take some pride in your work of promoting LLM-hype. So once again you are providing nothing to the conversation except for baseless accusations of insulting, which I did not do, and refuse to answer to the actual arguments I made. I can provide it again, but you would likely ignore it because it just showcases how you are clueless about the topic.

Your words, not mine:

> Third, and most importantly, the actual output measurably changes. Quants have a lower latency, higher TPS, and different token distribution.

I asked if you understood what "different token distribution" meant. I can tell you what it means: models performing worse at coding tasks. So people report models being worse at coding tasks, YOU write that indeed quantization leads to that and then just "forgot" about it? Nice level of "objective" discussion, Anthropic shill.

> So now I'm lying about my employment on an anonymous forum for... what, exactly?

It is not anonymous forum, as much as you would have liked it to be, so that your shilling could not be dismissed as easily, Anthropic shill. For what? So that people would fall for the bullshit you are peddling. Are you really this dense, Anthropic shill?


I must admit I skimmed most of your comment because it is largely an incoherent rant, but I will address some points:

> This is data.

Nope. Because frequency bias is a thing. If you hear on Twitter "model X got nerfed," your brain will look for that pattern and notice it more than usual. This will then confirm your suspicion, which leads to a vicious cycle. Then you tell your friends and the same phenomenon repeats.

None of this requires the model to get worse. It's a well understood psychological phenomenon.

> I can tell you what it means: models performing worse at coding tasks. So people report models being worse at coding tasks

The perception of a model performing worse at some coding task is not what "different token distribution" means. You should ask AI to explain my comment ;)

Latency and TPS can also tell you if you're getting a quant.

Anyways you should really get some help. Praying for you!


> frequency bias

Gaslighting again, Anthropic shill. What does frequency bias have to do with the objective fact that hundreds of people reported their own experiences with LLMs being degraded over a short period of time? The very same tasks that the very same LLMs could do, they no longer can? You seem to ignore this FACT, this DATA, and instead have to gaslight and divert into "frequency bias" nonsense. I do understand, why you are doing it, Anthropic shill, but at least have guts to admit it.

> perception of a model

You once again ignore what your LLM outputted and you typed yourself and divert into "perception", Anthropic shill. You do not need to sample entire output for tokens to notice the distribution moving. If the LLM used to be able to achieve set goals and no longer could, it is already a sign of the distribution shift. And as you said yourself, different token distribution = model being quantized. Which is reported in hundreds of separate instances. Which is more than enough to conclude that the model was, in fact, quantized, and no amount of gaslighting can change that. But you are an Anthropic shill, so you have to peddle your bullshit, trying to twist facts to support your employer's narrative. And you deserve being called out on that, Anthropic shill.


> What does frequency bias have to do with the objective fact that hundreds of people reported

This isn't hard to understand

"Model is nerfed" claim hits social media

Someone else sees it, frequency bias makes them think their model is also nerfed, and they amplify the claim

Now it spreads, like a virus, even if the model never changed

Social dynamics like this are well understood psychologically

> If the LLM used to be able to achieve set goals and no longer could, it is already a sign of the distribution shift.

The more likely explanation is that you're looking at older LLMs with rose tinted glasses, and misremembering what it could achieve

Otherwise you could measure the token shift and see the better tps and latency

Your own evals would trend down

But no one, not one person, has presented empirical evidence of being served a quant. Just vibes.


Did your LLM context get blown up or why does your comment read like linkedin-style post with one-liner sentences structure?

Did you really just claim that people are so gullible that it was social media or whatever that made them believe their LLM could no longer achieve the tasks it yesterday could, and not the actual FACT of LLM not being able to do it that they, you know, verified before complaining online? I guess if you gaslight everything like that, then indeed no matter what the facts are, you will never be convinced in anything.

You see, because of outlandish claims and reasoning (or rather lack of) like that, everyone sees that you are an Anthropic shill.


Having read this whole conversation (much like one feels the need to stare at a car accident), you sound truly insane. I hope that’s you were aiming for, otherwise you really need help


My god, man. Go read the HN guidelines, this method of communication isn't only insufferable to read, but is actively making this place worse to be a participant of.


All these tools have almost feature parity. The GitHub cli allows remote sessions and can run anthropic models anyway


When you say "code on your phone" ... you don't mean what I think you mean do you? Like, are you actually using your phone to make code commits?


Yes, you can do that with Claude Code.

Tell it what to do.

Commit, push to origin, review on GitHub.

Tell it to make changes, amend the commit, push --force-with-lease.

I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.

Mostly to test how much LLMs can actually scale development.

Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.

From a purely cost basis perspective, it's hard to argue they aren't killing it.

But from a multiplier perspective, it's up in the air how great they are.

It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.

So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.


You review the code on github also from your phone?


#1 -> part of scaling is you can't review every single line of code.

LLMs don't really scale if you're still the bottlneck, or they only scale as much as you reviewing every line of code - that's not that much scaling...

So I try to only review certain parts, like making sure they aren't changing tests to allow architecturally broken code to slip through (because they regularly try, even when given explicit instructions not to). Or if I'm watching them make changes on my phone and see that they are clearly doing the exact opposite of what they're supposed to be doing (regularly if I'm watching).

#2 -> if commits are small, GitHub's setup is good enough that you can review code on your phone.

#3 -> if they're huge, I can just review on my laptop at lunch or something.

Theoretically, all of this can be solved easily with orchestration and require minimal oversight.

If you're using LLMs to write code and you're carefully reviewing every line with a jade-handled magnifying glass, you're not really scaling - at least to the degree I'm interested in.


> LLMs don't really scale if you're still the bottlneck

This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.


> This only works if there's no consequences if your code breaks. In the eyes of other humans you're responsible for what you commit. No amount of "scaling" will change that.

You're only responsible for what you merge to master, not everything you commit to a feature branch no one is looking at...

If you have the testing and the infrastructure in place such that you can't ship broken code, then you just need to make sure your invariants are upheld - not that every single line is beautiful and perfect.

Further, I am working on a set of metrics that seems pretty good at identifying sloppy architecture. There's decent prior art at many different components of what "sloppy" architecture actually is, and ways to visualize it.

If you can rely on the consensus of several different models, plus your own judgement with the design and the testing in place to verify its implemented correct...

Then, 1) you don't need to code. 2) you only need to review 1/10th or less of the code written. That scales. Reading every line of code line by line doesn't really scale. LLMs aren't very fast at implementation outside of green-field projects. So you can often times implement something faster than they will if you did it by hand. Reviewing can take just as much time as implementing...


So you're making a programming language... and you don't want to read code. Have I got the gist of it?


At a proof of concept stage for a research language? Why on earth would I?!

For one, a project of this scale is literally completely unfeasible if you're reviewing every line hand by hand - or at least for me when this is a very minor side project...

It's a 1M+ line scope...

The odds it gets completed is not 100%.

The odds it turns out as well as I would like is not 100%.

The odds anyone ever uses it besides me even if it does is not high.

And the entire purpose of the project is to test how well LLMs can actually scale, which, as I mentioned, REQUIRES not reviewing every line with a jade handled magnifying glass...


Is there a place you are tracking or publishing your findings throughout this "test" ?


Considered Gemini?


Gemini got a big reduction in usage limits this week. There was backlash and they added 3x usage for Antigravity a day later but I haven't really tried it out to get a feel for it yet.


Google rug pulled Code Assist and Gemini CLI. They're moving everything to Antigravity and we would need to reinstall all our tooling, reconfigure any automations, and the mechanism to subscribe via GCP is much clunkier.

This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.

And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.


Google has burnt all of its goodwill in dev communities so no, I don't think Gemini is worth consideration.


“Expert” that does not know what a Terraform is. lol, lmao even


You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.

From Qwen3.6 page:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0


min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.

set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.

There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)


I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?


In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.

We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.

Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".


Cool, i am mostly a plumber for these things, but do you have any sort of reading that i can go through to wrap my head around it to some degree?


Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.


I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.

Gist with the compose and example of an output. https://gist.github.com/meaty-popsicle/f883f4a118ff345b430c3...


Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.


What would be these additional vllm flags, if you don't mind sharing?


This is from an example from my Nomad cluster with two a5000's, which is a bit different what i have at work, but it will mostly apply to most modern 24G vram nvidia gpu.

"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available

"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.

"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.

"--enable-prefix-caching" - Improves time to first output.

"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.

"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.


> "--speculative-config",

Regarding that last option: speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding.


Thank you!


Just curious, what's your setup like? How do the devs interact with the model?


OpenWebUI with postgres and vllm for inference, searxng for websearch a few other mcp's for tools.


question: why not use something like Claude? is it for security reasons?


Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?


I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.

But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.

So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.


We do make Claude and Mistral available to our developers too. But, like you said, security. I, personally, do not understand how people in tech, put any amount of trust in businesses that are working in such a cutthroat and corrupt environment. But developers want to try new things and it is better to set up reasonable guardrails for when they want to use these thing by setting up a internal gateway and a set of reasonable policies.

And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.


Because it's a great tool and the second it's not we can just do what you're doing :)


What a great write up, and a video too! Even though Minecraft stuff ofc was a bit of a bait, it would be interesting see the answer to "Can it run Doom?".


From the article:

Only 40,960 words of memory. That’s only 90kb total memory to split between our code and the memory it needs at runtime.

Looking at a copy of Doom on the Internet Archive (https://ia800404.us.archive.org/view_archive.php?archive=/15...), DOOM.EXE is about 709k, and DOOM.WAD is about 11159k.

I think that's a pretty solid no.

Also it's a 250khz CPU. Not megahertz. Kilohertz. It's slower than the 1MHZ 8-bit home computers like the Apple ][ or c64.

"Running" Doom might be possible with some insane hack that offloads storage and/or processing to more modern hardware crammed into the UNIVAC case but given that this is one of two UNIVACs in the entire world, and the only one that actually runs, I don't think the museum is gonna let anyone cram a Raspberry Pi up in there.


> a bit of a bait

"a bit" is doing a lot of work there. It was absolute nonsense. They were no closer to running a Minecraft server than I am to running UKGOV.


They hosted a program that allowed minecraft clients to connect... I'd class that as a minecraft server, even if it wasn't a very good one


> They hosted a program that allowed minecraft clients to connect...

Connect in the sense of receiving a login packet and saying "yes". That's it. Steps 1, 2, 3, 9, 10 of [0] (they didn't mention encryption or compression, I'm assuming they didn't implement it.)

They didn't mention anything about any of the steps past 10 - again, assuming they didn't implement them.

It's a trivial thing they've implemented - good work, sure, but a Minecraft server? Absolutely not.

[0] https://minecraft.wiki/w/Java_Edition_protocol/FAQ#What's_th...?


Not enough dedotated wam for all that.


Yeah, my thought exactly, execution lacked, but i do admire the attempt.


Not Doom, but a ZMachine interpreter might run with:

- Zork I-III

- Calypso

- Tristam Island

- All the Z3 machine games at IF archive

- The rest of Infocom propietary games

https://www.ifwiki.org/List_of_Z-machine_interpreters

Also: https://ifdb.org/viewgame?id=lkr2jf03np19ieix

Now, if the game was libre software it could be improved and ported to Puny Inform (a 'lite' version of Inform6 tuned for smaller machines) creating a really small Z3 file being able to play it from the PDP10 and 8 bit microcomputers to anything from today. From smartphones to PDA's to GNU/Linux with Frotz to Winfrotz and Lectrote and Fabularium for Android/Mac and iOS.

So, 'does it run Doom'? Man, you can play Zork in a pen with writting detection. How cool is that?


It could probably run the code for doom, once recompiled for the risc-v emulator, but given that the only output is a paper teletype, displaying it would be a problem


> but given that the only output is a paper teletype, displaying it would be a problem

You are in a maze of twisty passages, all alike. A cacodaemon floats by, hissing.


I wonder which would be faster: computing a frame, or printing it? If you could print one frame at a time, you could make a flip-book animation.


And given the NES emulator example, take half an hour per frame.


Feels kind of like when Usagi Eletric got "Doom" running on a vacuum tube computer with a teletype interface without support for even ASCII, but it was just an imitation of the background music.

Anything for the thumbnail.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: