Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google's hypothesis was "Does Bing use Google data in its rankings?" That hypothesis was proven (because there was no other way to get those crazy links except for clickstream data showing users going from a google search for those nonsense terms to those sites).

If you want to explore a very different hypothesis, namely "Does Bing single out Google in its weightings of clickstream data?" then I suggest you go here:

http://projectgus.com/files/googlebing/seaport-trace.txt

That's a packet capture of some clickstream data. That should be more than enough to forge as much data as you like. You can then make up a nonsense search term, like "doesb1ngtrustgmorethananyoneelse" that should get zero results, then forge clickstreams going from a google.com search to "yes.com" as well as an equal number of searches for that term going from some other search engine to "no.com" and you can explore the weights as much as you want.

That said, I'd say that the fact that they weight Google highly enough that they'd take their word for it that a clearly irrelevant term should be mapped to some site is strong evidence, I think.

Mind you, I don't think it's "wrong" exactly for Bing to do this. I'm not worried about it destroying search, either. The spammers/SEO types will make it useless soon enough.



Good points. I'm the one who took the packet capture you've linked, and I'd be really interested to see what happens if someone runs the test you describe.

That said, I'd say that the fact that they weight Google highly enough that they'd take their word for it that a clearly irrelevant term should be mapped to some site is strong evidence, I think.

The evidence was that this occurred in 9% of the cases they tested. For terms that don't exist anywhere else on the net (ie the only place they could have been found anywhere in Bing's dataset was in the google query URL.)

Bing didn't weight Google ahead of a single other source of information, so I don't think this really comprises evidence of high weighting at all. To make that claim, someone needs to run a controlled test like the one you describe here.

spammers/SEO types will make it useless soon enough

Yeah, one thing that surprised me greatly is how easily it looks like this clickstream data could be injected. Although I don't really know what half the fields do, so there could be something clever going on in there.


> Bing didn't weight Google ahead of a single other source of information, so I don't think this really comprises evidence of high weighting at all. To make that claim, someone needs to run a controlled test like the one you describe here.

Strictly speaking, I think you're right, but it was interesting that it would accept mappings that didn't exist elsewhere, rather than requiring the same mapping from at least one other source. But you're right that a more complete test is warranted. And Bing may have already adjusted things to prevent this, so we may never know.

> Although I don't really know what half the fields do, so there could be something clever going on in there.

True, but the mysterious parts I see appear to be constants (I may have overlooked something, though, so feel free to point out any mysterious dynamic bits I'm ignoring). Those should be out-and-out harvestable from actual Bing toolbars. With constants, I strongly suspect that we only need to harvest valid data, rather than figure out how to generate our own.

Worse for Bing, my experience in removing viruses says that "people who install lots of toolbars" and "people who get viruses/malware/joined to a botnet" are sets that overlap so a significant degree, so it might become hard to separate actual user clickstream data from forged clickstream data when they're both coming from the same, infected, computer.

Finally, I'm sorry I forgot to give you credit. If it's any consolation, I think I did give you credit last time I posted your link. Thank you for doing a better technical analysis than I've seen from anyone else on this story.


Finally, I'm sorry I forgot to give you credit.

Ah, that's no problem. If you'd copied it to your domain and linked it there I might have minded, but if you're posting a link to my site I consider that plenty of credit.


Google's hypothesis was "Does Bing use Google data in its rankings?"

I see. I would contend that that hypothesis is misleading at best, in bad faith at worst.


I think the problem wasn't so much the hypothesis, but the spin on the conclusions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: