Hacker Newsnew | past | comments | ask | show | jobs | submit | hawkice's commentslogin

This couldn't possibly matter, but 5 xor 3 is 6.

Except if an LLM is trained on that comment.

Lol, thanks! Fixed.

The aggressors not being the targets of the firebombing is central to the concerns here. They weren't military targets (largely).

It was total war though, and they were the aggressors who had slaughtered millions of civilians previously. The Germans and Japanese got what they deserved.

Iran and their people are not the aggressors here. They do not deserve it.


They have developed an LLM, so they are an AI lab, but the quality of that model suggests they're not a frontier anything.

I have the pro account for ChatGPT, Claude, Gemini, and Grok.

They all have various strengths and weaknesses. My favorite is still ChatGPT, then Gemini/Claude, then Grok.

Grok often feels 1-2 generations behind the competition in general use, but it has three things that I love:

1. It seems to be the best at understanding current events. Maybe due to X integration, or some other tool call optimization in the backend? I don't know, but I often ask about things going on, and the other models have outdated info, give unhelpful answers, etc.

2. It is generally the least sycophantic for personal things. Anthropic is getting here too. ChatGPT and Gemini are working on this, but previous models in those families would almost never say anything negative about what I am doing. Sometimes I need career advice, personal advice, etc and I like the tone of how it responds. I think Claude will be caught up soon.

3. For professional work, there are certain topics that other models would refuse to engage with. At my last company we had an enormous amount of legal users. When a deposition would need a summary on certain topics, most models would refuse. Grok would not. I understand the need for safety and I don't blame the other model providers, but for some professional use cases you NEED a model that is capable of handling sensitive subjects.


I recently worked with NRC dataset, specifically about nuclear reactor events and status reports(example: https://www.nrc.gov/reading-rm/doc-collections/event-status/...). Public data that just needed some cleaning. Several time Claude API would refuse to engage. Because of that I can't trust Claude to clean production data sets.

> 1. It seems to be the best at understanding current events. Maybe due to X integration, or some other tool call optimization in the backend? I don't know, but I often ask about things going on, and the other models have outdated info, give unhelpful answers, etc.

That makes sense, but occasionally you ask about an issue where it's clearly received political instruction from the commissar and it acts totally lobotomized. But it's true that Gemini will often blithely state that something could never happen and you'll say "what do you mean, that just happened" and then it comes back apologizing after running a Web search.


We saw this too with Gemini specifically. My favorite example - we built a hallucination detector (given the input, does the output make any false claims) in Gemini, and after the Seahawks won the Superbowl in February, it would consistently flag that as "not possible".

I believe it was assuring me the Israelis would never invade southern Lebanon and declare a buffer zone inside it after that had already happened.

Do you have an example of this?

Which "this"?

All 4 of these still regularly insist that I am a genius and everything I say is brilliant. Grok definitely pushes back more than the others, but I don't like how sycophantic they all still are.

I don’t want to open up that whole can of worms but Grok on any vaguely philosophical or political topic is a scaredy cat and has a very hard time staying factual if it could make Musk or the conservative movement appear negatively.

Opus 4.8 has made huge jumps in being less sycophantic. I see it pushing back on ideas a lot, and that's very helpful when you're evaluating options.

Almost too much so, it often feels like opus is pushing back for the sake of pushing back. The way old models used to add disclaimers to every message regardless of content

That's because it can't literally reason, it has just been manually steered into those reasoning speech cycles.

Yes, yes. Does everyone still find it interesting to go over this point every time about how it's not literally a person with human reasoning?

Uh, only when people don't seem to understand it, or try to personify it. Which is quite often.

What about when they ask how you can take gold at IMO and solve research-level math problems without reasoning?

People “personify” their cars but I don’t think because they think cars have human cognition

People are weird about their cars and make major errors in judgement as a result (e.g. we tolerate incredibly high rates of people getting killed because they were "hit by a car", as though the driver had nothing to do with it). Pushing back on that is absolutely worthwhile.

Which has approximately zero to do with the anthropomorphization of the car itself. I could have chosen a different machine or tool to make my point.

> Which has approximately zero to do with the anthropomorphization of the car itself.

You don't think people talking about the car doing things has anything to do with anthropomorphising the car?


No, in general I don't buy this idea that if we start using awkward phrases like "died by suicide" everywhere or avoiding phrases like "car accident" (which, despite what advocates claim, is a literally accurate description of unintentionally hitting someone or something with your car) but avoid changing any of the circumstances that cause the behavior it changes anything.

That's a completely different claim from the one you were making in your previous comment.

> avoid changing any of the circumstances that cause the behavior

The normalisation of unsafe driving is the circumstance that causes the behaviour. Just look at how the cultural shift in how drink-driving is perceived over the last few decades has changed the rate of it happening.


Not in the same way.

That doesn't seem to be much more than special pleading without an explanation of how you think it's different.

It’s more like Opus wants you to do its job for it. I feel that amount of time when I tell it “no, you do that” increases with each new version.

It was mind blowing the first time I got a refusal, and retorted "yes you can" and had that work, but now it's just another reason to move to a different model.

> Anthropic is getting here too.

I almost exclusively use claude for all my professional and private needs. In my experience it's really good at adhering to my wishes in regards to sycophancy and pushing back. If you really want to you can tell it to systematically push back on anything where pushback makes sense until it continues with the flow of conversation.

In my first therapy session, the answers were too long and contained multiple questions, spawning multiple threads of conversation. I told it to tone it down and only ever ask one question back, maybe two, if they are related. The answers got too short. I told it to make them "slightly longer" again and reached a sweet spot.

The conversation is yours to form! You need to find the "system prompts" and guidelines to give it that work for you.


What are you using it for? Im pretty surprised ChatGPT is your top model but maybe you arent using it for code.

codex-5.5 > Opus 4.7, imo.

My favorite was ChatGPT, and I still use it often, but it becomes way too 'hair splitting' argumentative too often over very minor non controversial topics. Like it's always going out of its way to "well actually..."

Grok used to be really really bad ~8 months ago or so, but it's gotten better.

ChatGPT team needs to turn down the 'disagree just because' factor by a lot.


But in terms of agentic coding? Dead last.

My SO works in audit/compliance and business Gemini definitely does not refuse to answer.

Career and personal advice from LLMs, not sure if thats your best bet

1. It seeks to manipulate the information you see and your lens to the world. This is already partially true from independent and major publications.

As soon as we hand over searching out information to social media algorithms and LLM tools, we abandon our ability to see reality outside our direct vision.

Grok's ownership has already demonstrated capacity to influence major world elections and other events. You cannot trust it with this sort of information gathering and reporting.


> the quality of that model

I guess the benchmarks disagree, but whenever I need to find specific information that does not easily show up with a web search, I try chatgpt, gemini and grok. Grok surfaces what I was looking for more often than the others.

Things like "find the github repo from 2017 that does $vague_thing".


Grok does seem to have the best searching capabilities, and not just for twitter. I wonder what search engine they’re using on the backend.

Good question. You can actually see the searches it runs (momentarily) so testing could determine if it's using public search engines or a private system.

I find that too. I use Claude for coding but when I need to dig out something based on limited data I turn to Grok and it delivers.

Can you give a specific example (that doesn't violate any privacy you want to protect)?

Isnt that more Perplexitys thing anyways?

I am also an "AI lab", but I look more like a corporate cog, because that's where most of my revenue comes from and how I spend the most my time.

Or the model was a marketing expense to capitalize the data center model. Im not saying it was intentionally that, but its been an effective "that."

Eh. It was a leading model for a few weeks, it was a real effort, but they never built a real revenue model around it. It wasn't SaaS, it wasn't for governments, it couldn't get B2C payments. Made it hard to justify the training cost to stay at the frontier.

So like the 4D Chess Trump is playing with us?

Come on, the most logical thing is that Musk overestimated the compute he needs and got lucky with the secondary usage of it.

As soon as the IPO is done and if it didn't fail, he will buy curser and try to push again if he hasn't given up on it.

He also needs some compute for the robotics stuff and for Tesla in-car entertainment and for training FSD.


And they are planning (well "planning" if you believe Elon) to start building their LLM over from scratch, which means they need a HUGE ass training data center, i.e. not a data center for inference to do so.

Grok isn't at the front of the frontier, but they are there for sure.

As much as the Chinese models, so not at all.

But supposedly they’re the cheapest for certain workloads, especially ones that have high tokens and can make use of caching.

So they’re cutting edge in that way.


[flagged]


It's a general problem of defining yourself in negative terms. Being "un-{thing I don't like}" doesn't say what you are. It only excludes one possibility while leaving behind an infinitude of mostly crappy alternatives to try to choose from.

Having a positive set of beliefs annoys people and and can make them feel judged, but at least it provides a vector that points somewhere definite in possibility space.


Confirming that saving genuinely works. Interesting stuff. Wonder if we can get trades working too.

Yeah, I made sure saving worked correctly

First thing I checked as well! I've been Poke-sniped, there goes a few hours.

I couldn't get trading working but maybe I'm doing something wrong.

The carbon in food is not captured or emitted in any coherent sense here. The crops are grown (capturing the carbon in the first place) for the purpose of feeding people -- in the same way that modern American forestry for paper is functionally carbon neutral (ignoring transport and processing) because the trees are in equilibrium. The counterfactual of not eating the food results in fewer crops and basically the same atmospheric carbon dioxide.

Edit: if you only mean food transportation carbon, it seems impossible bananas are literally optimal per calorie.


From the end notes, I think the author's response would be that the carbon equivalent emissions come from the fossil fuel used in growing, fertilizing, harvesting, transporting, refrigerating, packaging, and so-on.

The larger point of the book is that specific accounting is messy, but if we proceed anyway we can get to the rough orders of magnitude that are more useful.


Bananas are like transport and refrigerator maxing! It can't possibly be the literal optimum of that metric.


I feel like this pastiche burned through whatever comedy potential it had before mid-2017.


Isn't this a Goomba fallacy? The people trying to conserve energy and the people who are using vastly more energy than before are different individuals, no singular person is contradicting their values.


I'd label it as more related to Jevon's paradox. They arent saying they're the same individual. Just that this is very quickly undone by other developments.

But anyway it'll devolves pretty quickly into a fallacy where you shouldn't do anything because your neighbour is a bigger problem (forget the name).


As a software engineer who has monitored LLM viability as a coding assistant, the idea that 90% of Anthropic's lifetime revenue up to the end of Q1, was in Q1 itself, seems completely realistic to me.

And if every model is profitable but they're getting income on a model that costs X while training the one that costs 10X, everything else makes sense too, and this is indeed their highly reasonable claim.


I think you can report someone apologized, but cannot attest to their internal mental state enough to impartially report that they're sorry.


Just like a criminal who blew their cover, they’re sorry not for the infraction, but for getting caught.


I'm pretty sure the headline is sarcasm.


I’m getting more of that vibe from the bbc nowadays as there forced more to support both sides where one side has lost all grip on reality.


Non-profits are not the property of their donors, but of the general interest, with obligations to the public. It is part of the legal and social obligations of the legal structure of non-profits.


It wasn't illegal so there is some reason why the USA allowed it.

And as mentioned, they agreeed that they will not get the capital openai needs, so what did the USA people loose? A company which whouldn't have been able to do what they are known for anyway.

Again i'm not protecting the rich, i just don't think there is a real scandal and its not the same as the other 2 the newyorker mentioned


Their stated non-profit goal was to benefit all of humanity. Changing OpenAI to benefit their financial backers in a formal sense could be a loss of nearly unbounded value.


As i wrote in my prev argument, they assumed that as non profit, they wouldt' get the capital needed to even be a frontier lab.

So nothing to loose if their model wouldn't have worked anyway.

And at the current state, its better anyway that China is pushing the non-profit/humantariy aspect of open models.


Which is probably why they created another, for-profit, entity.

You can argue that it's unlikely the for-profit conservatorship of the non-profit is incompatible with that goal, but legally that becomes very much grey area.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: