Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Big List of Naughty Strings (github.com/minimaxir)
409 points by polm23 on May 24, 2020 | hide | past | favorite | 92 comments


Solid list for a quick SQL injection and XSS reference with lots of examples. Even unicode/accents/two-byte characters etc are super useful to check handling on all the way from the front-end to the persistent storage solution (DB, etc).

Lost it laughing at "Human Injection" section:

> # Strings which may cause human to reinterpret worldview

> If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.


I would wake up if I could, but I opened this string in vim and I can remember how to exit.


That was a bastardization of the original one on 4chan's /x/. This one is the real one:

> It has been reported that some victims of torture, during the act, would retreat into a fantasy world from which they could not WAKE UP. In this catatonic state, the victim lived in a world just like their normal one, except they weren’t being tortured. The only way that they realized they needed to WAKE UP was a note they found in their fantasy world. It would tell them about their condition, and tell them to WAKE UP. Even then, it would often take months until they were ready to discard their fantasy world and PLEASE WAKE UP.


> This one is the real one

Are you sure? I think this idea has been explored in several novels and films over the past several decades.


The /x/ copypasta was the one I saw first sometime around 2007, and only later did I see the one quoted the file. It's possible that the file's story was earlier.


Never seen it before on film. The Manchurian Candidate isn't it. Can you give me some pointers?


I know the Futurama episode "The Sting" explores this idea.


Doesn’t The Matrix explore this idea?


See also Ghost in the Shell 2: Innocence. There's a lot of hacking going on in that one leading to some very interesting scenes.

Granted, they're not comatose, but the sensations their minds are receiving do not match reality.


Wake up, Neo... The Matrix has you...


Sucker Punch


I know I first came across BLNS about 5 years ago because I was 20 years old when I first saw that.

Suffice to say it scared the shit out of me!


> Please wake up, we miss you.

I think that sentence gives itself away as modern. Were comma splices in common use 20 years ago?


Yes they were. IIRC at least one of the major manuals of style endorsed them at least in some situations.


I've been curious about this sort of thing, but I haven't been able to find good info on it. For example, it seems to me that the format of "It's not x, it's y" is widely accepted. I'm not sure if that's by simple common use or by way of some nuance that I'm not aware of. Certainly in some cases it can be difficult to avoid both comma splices and sounding stilted. I'd love to know more -- I don't seem to recall anything about it from when I had a copy of the Chicago Manual of Style.


No one accepts "It's not x, it's y" as correct. Well, maybe some illiterate Philistines do. But no one really thinks that's correct.

You have to put a semicolon in place of the comma or it is just plain wrong.


That’s interesting. Maybe it’s considered hyper correct?

Which style manual are you referring to?


Don't remember. This was 30 or 35 years ago.

There was a long tradition in English and maybe also in other European languages of choosing punctuation depending on how long a pause a speaker would take at the point, with a comma denoting a shorter pause than a semicolon, which in turn denoted a shorter pause than a colon, which in turn denoted a shorter pause than a period -- instead of using punctuation to convey some semantic distinction as most writers do today. The comma splice might be a vestige of that tradition.


But the pauses are semantic in a way. Those concepts cannot be well separated. Not a linguist, but I can tell pauses are used by speakers and listeners for a variety of reasons, linked with both semantics and delivery -- to separate parts, to give and demarcate "processing time" for speakers and listeners, to create and satisfy expectation, etc. The cadence of speech is semantic (though its semantics are not of a simple propositional logic type).


Im reading Permutation City right now and that message is really messing with my head


This messes with my mind


This is deiciously ironic:

> Also, do not send a null character (U+0000) string, as it changes the file format on GitHub to binary and renders it unreadable in pull requests.


For long I had the same problem with Japanese language files being shown as binary in the GitHub diffs, and it is solved by having something like this in the .gitattributes file

    *.php diff
Overall I am amazed that everything shows properly in GitHub. https://github.com/minimaxir/big-list-of-naughty-strings/blo...



OK, seeing "﷽" [1] was unexpected :). For those who does not know, it's very important for muslims and It's all over the Quran

[1] https://github.com/minimaxir/big-list-of-naughty-strings/blo...


It's a very unusual "character" as it can sometimes be rendered very, very wide- the same width as maybe 20 or so "normal" characters. (whatever "normal" means. The more I learn about unicode...) The same width as if you had typed it out in a sentence.

It breaks a lot of UIs that use character count as a proxy for string width.


Given that writing systems we have today are pre-digital systems of marks made with the human body and tools, all encoding of text in our HCI is a simulation of writing. Simulations by nature cannot be 100% accurate. choices have to be made about what is represented and simulated. choices have to be made about what to leave out. Simulations are a form of communication, they put forward a perspective about what is the important part of what is being simulated. Seeing a symbol like "﷽" really brought home for me some of the assumptions about writing that I take for granted, and are inherently "argued for" by the way text is handled in our digital systems.


commit that added it if anyone is interested https://github.com/minimaxir/big-list-of-naughty-strings/com...


Can't it be made up of individual characters or is it stylized in a unique way?


what does it mean?


"In the name of God, the Most Gracious, the Most Merciful."

From Wikipedia: https://en.wikipedia.org/wiki/Basmala

Disclaimer: I'm not a Muslim, I don't know Arabic.



I had to zoom in to 400% to be able to see the detail there.


https://www.urbandictionary.com/define.php?term=%EF%B7%BD

Fun fact: it’s a single Unicode character.


Yup, you can put 280 of it into a single tweet.


I don’t think that’s right. I looked into the way Twitter counts characters when I was trying to work out the largest prime number that could be written out in full, in base ten, in a single tweet[1]; the rules are more complicated than you might expect, and have changed several times.

The current rule seems to be that all Unicode characters count as two, except for the ranges 0–4351, 8192–8205, 8208–8223 and 8242–8247 which count as one.

[1] In case you’re wondering, I think it’s, arguably: https://twitter.com/robinhouston/status/1197294154738544641


Good point! Still, I could swear I saw someone (@FakeUnicode?) do exactly this once, but of course I can’t find that tweet any more, partly because it turns out that search engines don’t handle ﷽ well at all, and I don’t feel like testing it on my own followers somehow.

Edit: it looks like it might count it as two characters, so that’s only 140 per tweet.


That’s definitely possible! @FakeUnicode mentioned in the discussion that, when 280-character tweets were first introduced in September 2017, it was possible to tweet 280 single-codepoint emoji using TweetDeck.

https://twitter.com/fakeunicode/status/1197282221503041537

There are several amusing examples in the thread linked from this tweet.


I can confirm that I tried it and found the max to be 140 times ﷽


yeah I didn't know it until i tried to copy-paste to post here :)


I encountered an amusing instance of this recently watching my six-year-old son playing music on the kitchen Alexa. Alexa felt it was necessary to censor the name of a children’s song entitled, “Pussy Cat, Pussy Cat.”


When I saw the title I thought it was a list of profanity that one might want to filter out from an open web application (i.e. a list that also includes swear words from multiple languages).


That’s almost worse than not censoring it heh.


Also tangentially related: the big list of usernames that should be disallowed in any online system: https://github.com/forwardemail/reserved-email-addresses-lis...


Ugh, that list might be why my email address mail@[personal domain] is forbidden more and more often.


I use a catch-all and mark some handles as spam once a website has been hacked, or the email has been caught by spammers.

I trust you've probably adopted a similar workaround for problem websites.

For throwaway websites (where I use a password I have no intention of keeping track of), I often sign up as "spam@domain".. This is surprisingly blocked in a lot of instances.


This looks overly opinionated: I can’t have co-op but I could have accounting?

I’d rather see a list with justifications.


Why would they disallow every language name? This list says it's for "security concerns"... are they trying to curb social engineering by just disallowing words?

[append]

Hmm, I guess this list should be declared as "should not be allowed to be created by users of SHARED email hosts".


Interesting that "zuck" is disallowed.



What? only 151 russian words? The russians have an own dedicated sub-language which consists solely out of bad words. No idea or concept is too complicated to be expressed in bad words alone. They switch from normal russian to bad words russian as soon as the situation allowes it.


[citation needed], please, I wanna read about this beautiful gem of a sub-language!

All I can find with some quick searching is Wikipedia's page on 'mat'[1], which seems to be pretty similar to Carlin's list of Seven Words You Can't Say On TV[2] rather than an entire language of vulgarity.

1: https://en.wikipedia.org/wiki/Mat_(Russian_profanity) 2: https://en.wikipedia.org/wiki/Seven_dirty_words


Unusual richness of bad Russian words is, I believe, mostly a myth from our stand-up comedians based on the loose facts that (1) there's not one central f-word, but several equally important roots, and (2) technically a lot of Russian words can correspond to one English word due to heavier use of prefixes ("fuck off" is written without whitespace in Russian).


Damn, I was suspecting something like that was the case from reading the WP pages I linked to and some of the ones they reference! Handing my SO an article about the intricate sub-language of Russian invective would have been a PERFECT piece of flirtation. (Yes we are nerds.)


I don't really have a citation, just personal experiences. But your wiki link already states: "David Remnick believes that mat has thousands of variations"

The german version of the wiki article has some examples:

пиздеть (pisdet′) = to (tell a) lie, but also possible: to steal.

пиздец (pisdez) = ruin or catastrophe, fubar, fucked up

As you can see, small variations of the same word have different meanings. And the meaning can vary with context.

Edit:

The russian version of the article has some example verbs (all derived from the same word): I tried to translate as far as I understand.

ебануть = ?,

ебануться = to get stupid ( to say or do something stupid ),

ебаться = to do something - very generic,

ебиздить = ?,

ёбнуть = ?,

ёбнуться = ?,

ебстись = ?,

въебать = to get something into something?,

выебать = to get something out of something?,

выёбываться = to fuck around. To decline something to somebody.

доебать = ?

доебаться = to annoy somebody by asking too many questions or trying to get something out of somebody.

доёбывать = to bring something to an end. to finish something,

заебать = to annoy somebody, to get to somebody

заебаться = to get fed up with something, to get tired of something,

наебать = to betray somebody?,

наебаться = to get saturated by something, to get enough of something

наебнуть = to cheat sombody,

наебнуться = for the fun of it, just do it,

объебать = to get a piece of something, to get familiar with something?,

объебаться = to get into something nasty, to fuck up,

остоебенить = blow you mind (not sure about this one)

остоебеть = ?,

отъебать = to fuck with something/somebody, to fight, to degrade somebody,

отъебаться = get rid of something, for example to pass exams or get rid of obligations,

переебать = (maybe) to understand something,

переебаться = to fuck over something, to get something done,

поебать = do something (depending on context),

поебаться = to do nonsense, try hard to no avail, fool around,

подъебать = ?,

подъебаться = ?,

подъебнуть = to make joke, for example april the 1., even flirt?,

разъебать = to ( accidantly or pusposfully ) break something,

разъебаться = to clear a relationship, to bring everything to order?,

съебать = to get away (maybe in a cool way),

съебаться ( to fuck off, to stop anoing ),

уебать = to run away (maybe after stealing something)


This list mostly corresponds to my personal experiences as a native Russian speaker, and I've heard 70% used in the wild. To fill two of the ?'s:

  ёбнуть = to f*ck

  ёбнуться = to stumble or injure oneself
What's interesting is that this whole list only uses one root word, of which there are 3 more


Thanks! Shit, that’s a fuckton of sweary words.


Hilarious, but also important!


.

.

.

.

.

.

ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็

Wow, what's this? :)


Layers upon layers of combining diacritics.


Have you tried using regular expressions to parse HTML?

https://stackoverflow.com/questions/1732348/regex-match-open...


Thai tone marks. On my browser they’re spaced out using a dotted circle, as you should really only ever have one per character


It reminds me a little bit of Feynman diagrams.


I don't quite understand the purpose of this list. It contains potentially malicious input, but also emoticons based on Unicode characters that are completely harmless and used in every second post on Reddit.


I made the list while I was a Software QA Engineer at Apple, since there were a bunch of fun Unicode strings causing particular issues there, which gave me the idea.


I think the purpose is to run these strings through your inputs and make sure it doesn’t behave in unexpected ways.


It's essentially a test suite for character encoding all throughout your application. If you input all those strings (e.g. send chat message) and they arrive incorrectly at some other end (e.g. other user receiving chat message) then there's a problem somewhere.


That makes sense. Thanks a lot! Of course, it's very useful for testing. I erroneously assumed it was for input validation.


There's a lot of ways to mishandle unicode. Checking that non-BMP characters work, that emojis in various sections work, and that emojis with modifiers work are all good tests.


It's useful for testing a variety of things that take text/string inputs, such as forms in web applications. It's a handy tool for testing a site (preferably one you have permission to test) for XSS or SQL injection, character encoding problems, or even just form length problems.


Strongly advise not using cat on the list, you will get beeped at.


would that be considered animal abuse :D


Yup, can't view the file using the GitHub app for Android


Out of curiosity, what happens when you try to do so?


Something went wrong

<button>TRY AGAIN<button>

Edit: as far as I could see it's only opening blns.txt that causes this error the other files are fine in the app.


The latest commit message of README is "Merge branch 'master' into master" [1]. As someone who doesn't do git, what does that even mean? Does git allows multiple branches with same names?

[1]: https://github.com/minimaxir/big-list-of-naughty-strings/com...


The GitHub flavor of git essentially has two parts:

Repository (i.e., minimaxir/big-list-of-naughty-strings), and branch (i.e., master)

In this case, someone merged eliabieri/big-list-of-naughty-strings at master to minimaxir/big-list-of-naughty-strings at master via a pull request [1].

[1] https://github.com/minimaxir/big-list-of-naughty-strings/com...


Usually it means they merged 'origin/master' with their local 'master'.

Git is distributed, so everyone can have their own copy of the same branch.


Jimmy Clitheroe - the Clitheroe Kid. That brings back some memories. It's also nice to see that England is suitably represented in the place names, obviously Scunthorpe is the classic. I'll tender Somerset for first amongst equals for daft and downright odd place names.


We need a similar list of weird unconventional emails to make sure every new registration form won't erroneously reject a valid email.

The number of times I get validation errors or some unexpected crashes when I enter my fairly pedestrian email with + sign in it... Jeez.



This one got me:

> "If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."

Strangely terrifying....


Eh, it's just a meme:

https://www.reddit.com/r/copypasta/comments/5we0ny/if_youre_...

I guess it is intriguing, in a Roko's Basilisk sort of way.


The question is, which one of us is the message meant for?


Why would there be more than one of you?


Why are there so many of me posting in this thread?


Username checks out.


Who says it's only meant for one?


the real question is: Inception. Can it be done?


I did feel sort of let down that they didn't have man-hole cover.

on edit: yeah, I'm not gonna send a pull request on that one.


Neat. I recently found that googling the Japanese Post Office emoji results in a totally borked SERP (cross-browser, desktop, including desktop mode on Android Chrome). I assume there are other characters as well.


I just do:

  .matches("[a-zA-Z0-9.\\-]+")
And prepared statements or my own NoSQL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: