The Big List of Naughty Strings

folkhack · on May 24, 2020

Solid list for a quick SQL injection and XSS reference with lots of examples. Even unicode/accents/two-byte characters etc are super useful to check handling on all the way from the front-end to the persistent storage solution (DB, etc).

Lost it laughing at "Human Injection" section:

> # Strings which may cause human to reinterpret worldview

> If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.

yosito · on May 24, 2020

I would wake up if I could, but I opened this string in vim and I can remember how to exit.

Konohamaru · on May 25, 2020

That was a bastardization of the original one on 4chan's /x/. This one is the real one:

> It has been reported that some victims of torture, during the act, would retreat into a fantasy world from which they could not WAKE UP. In this catatonic state, the victim lived in a world just like their normal one, except they weren’t being tortured. The only way that they realized they needed to WAKE UP was a note they found in their fantasy world. It would tell them about their condition, and tell them to WAKE UP. Even then, it would often take months until they were ready to discard their fantasy world and PLEASE WAKE UP.

yakshaving_jgt · on May 25, 2020

> This one is the real one

Are you sure? I think this idea has been explored in several novels and films over the past several decades.

Konohamaru · on May 25, 2020

The /x/ copypasta was the one I saw first sometime around 2007, and only later did I see the one quoted the file. It's possible that the file's story was earlier.

dejj · on May 25, 2020

Never seen it before on film. The Manchurian Candidate isn't it. Can you give me some pointers?

dorkwood · on May 26, 2020

I know the Futurama episode "The Sting" explores this idea.

yakshaving_jgt · on May 25, 2020

Doesn’t The Matrix explore this idea?

james-skemp · on May 25, 2020

See also Ghost in the Shell 2: Innocence. There's a lot of hacking going on in that one leading to some very interesting scenes.

Granted, they're not comatose, but the sensations their minds are receiving do not match reality.

rsecora · on May 25, 2020

Wake up, Neo... The Matrix has you...

HeadsUpHigh · on May 25, 2020

Sucker Punch

ancarda · on May 25, 2020

I know I first came across BLNS about 5 years ago because I was 20 years old when I first saw that.

Suffice to say it scared the shit out of me!

foresto · on May 24, 2020

> Please wake up, we miss you.

I think that sentence gives itself away as modern. Were comma splices in common use 20 years ago?

frank2 · on May 24, 2020

Yes they were. IIRC at least one of the major manuals of style endorsed them at least in some situations.

cperrine · on May 24, 2020

I've been curious about this sort of thing, but I haven't been able to find good info on it. For example, it seems to me that the format of "It's not x, it's y" is widely accepted. I'm not sure if that's by simple common use or by way of some nuance that I'm not aware of. Certainly in some cases it can be difficult to avoid both comma splices and sounding stilted. I'd love to know more -- I don't seem to recall anything about it from when I had a copy of the Chicago Manual of Style.

ianamartin · on May 25, 2020

No one accepts "It's not x, it's y" as correct. Well, maybe some illiterate Philistines do. But no one really thinks that's correct.

You have to put a semicolon in place of the comma or it is just plain wrong.

maxfan8 · on May 24, 2020

That’s interesting. Maybe it’s considered hyper correct?

Which style manual are you referring to?

frank2 · on May 25, 2020

Don't remember. This was 30 or 35 years ago.

There was a long tradition in English and maybe also in other European languages of choosing punctuation depending on how long a pause a speaker would take at the point, with a comma denoting a shorter pause than a semicolon, which in turn denoted a shorter pause than a colon, which in turn denoted a shorter pause than a period -- instead of using punctuation to convey some semantic distinction as most writers do today. The comma splice might be a vestige of that tradition.

gnramires · on May 25, 2020

But the pauses are semantic in a way. Those concepts cannot be well separated. Not a linguist, but I can tell pauses are used by speakers and listeners for a variety of reasons, linked with both semantics and delivery -- to separate parts, to give and demarcate "processing time" for speakers and listeners, to create and satisfy expectation, etc. The cadence of speech is semantic (though its semantics are not of a simple propositional logic type).

Timpy · on May 25, 2020

Im reading Permutation City right now and that message is really messing with my head

niklasbuschmann · on May 24, 2020

This messes with my mind

afandian · on May 24, 2020

This is deiciously ironic:

> Also, do not send a null character (U+0000) string, as it changes the file format on GitHub to binary and renders it unreadable in pull requests.

Seb-C · on May 25, 2020

For long I had the same problem with Japanese language files being shown as binary in the GitHub diffs, and it is solved by having something like this in the .gitattributes file

    *.php diff

Overall I am amazed that everything shows properly in GitHub. https://github.com/minimaxir/big-list-of-naughty-strings/blo...

dang · on May 24, 2020

See also:

2018 https://news.ycombinator.com/item?id=18466787

2017 https://news.ycombinator.com/item?id=13406119

Show HN from 2015: https://news.ycombinator.com/item?id=10035008

harunurhan · on May 24, 2020

OK, seeing "﷽" [1] was unexpected :). For those who does not know, it's very important for muslims and It's all over the Quran

[1] https://github.com/minimaxir/big-list-of-naughty-strings/blo...

nwallin · on May 25, 2020

It's a very unusual "character" as it can sometimes be rendered very, very wide- the same width as maybe 20 or so "normal" characters. (whatever "normal" means. The more I learn about unicode...) The same width as if you had typed it out in a sentence.

It breaks a lot of UIs that use character count as a proxy for string width.

totetsu · on May 25, 2020

Given that writing systems we have today are pre-digital systems of marks made with the human body and tools, all encoding of text in our HCI is a simulation of writing. Simulations by nature cannot be 100% accurate. choices have to be made about what is represented and simulated. choices have to be made about what to leave out. Simulations are a form of communication, they put forward a perspective about what is the important part of what is being simulated. Seeing a symbol like "﷽" really brought home for me some of the assumptions about writing that I take for granted, and are inherently "argued for" by the way text is handled in our digital systems.

voxic11 · on May 24, 2020

commit that added it if anyone is interested https://github.com/minimaxir/big-list-of-naughty-strings/com...

lopmotr · on May 24, 2020

Can't it be made up of individual characters or is it stylized in a unique way?

cheez · on May 24, 2020

what does it mean?

gnulinux · on May 24, 2020

"In the name of God, the Most Gracious, the Most Merciful."

From Wikipedia: https://en.wikipedia.org/wiki/Basmala

Disclaimer: I'm not a Muslim, I don't know Arabic.

mmastrac · on May 24, 2020

https://charbase.com/fdfd-unicode-arabic-ligature-bismillah-...

Also fun is ﷺ, (https://charbase.com/fdfa-unicode-arabic-ligature-sallallaho...) which has the longest unicode decomposition IIRC.

beobab · on May 24, 2020

I had to zoom in to 400% to be able to see the detail there.

ctdonath · on May 24, 2020

https://www.urbandictionary.com/define.php?term=%EF%B7%BD

Fun fact: it’s a single Unicode character.

atomwaffel · on May 24, 2020

Yup, you can put 280 of it into a single tweet.

robinhouston · on May 24, 2020

I don’t think that’s right. I looked into the way Twitter counts characters when I was trying to work out the largest prime number that could be written out in full, in base ten, in a single tweet[1]; the rules are more complicated than you might expect, and have changed several times.

The current rule seems to be that all Unicode characters count as two, except for the ranges 0–4351, 8192–8205, 8208–8223 and 8242–8247 which count as one.

[1] In case you’re wondering, I think it’s, arguably: https://twitter.com/robinhouston/status/1197294154738544641

atomwaffel · on May 24, 2020

Good point! Still, I could swear I saw someone (@FakeUnicode?) do exactly this once, but of course I can’t find that tweet any more, partly because it turns out that search engines don’t handle ﷽ well at all, and I don’t feel like testing it on my own followers somehow.

Edit: it looks like it might count it as two characters, so that’s only 140 per tweet.

robinhouston · on May 24, 2020

That’s definitely possible! @FakeUnicode mentioned in the discussion that, when 280-character tweets were first introduced in September 2017, it was possible to tweet 280 single-codepoint emoji using TweetDeck.

https://twitter.com/fakeunicode/status/1197282221503041537

There are several amusing examples in the thread linked from this tweet.

cmehdy · on May 24, 2020

I can confirm that I tried it and found the max to be 140 times ﷽

harunurhan · on May 24, 2020

yeah I didn't know it until i tried to copy-paste to post here :)

dhosek · on May 24, 2020

I encountered an amusing instance of this recently watching my six-year-old son playing music on the kitchen Alexa. Alexa felt it was necessary to censor the name of a children’s song entitled, “Pussy Cat, Pussy Cat.”

inetsee · on May 24, 2020

When I saw the title I thought it was a list of profanity that one might want to filter out from an open web application (i.e. a list that also includes swear words from multiple languages).

swiley · on May 25, 2020

That’s almost worse than not censoring it heh.

jzl · on May 24, 2020

Also tangentially related: the big list of usernames that should be disallowed in any online system: https://github.com/forwardemail/reserved-email-addresses-lis...

DominikPeters · on May 24, 2020

Ugh, that list might be why my email address mail@[personal domain] is forbidden more and more often.

Mandatum · on May 25, 2020

I use a catch-all and mark some handles as spam once a website has been hacked, or the email has been caught by spammers.

I trust you've probably adopted a similar workaround for problem websites.

For throwaway websites (where I use a password I have no intention of keeping track of), I often sign up as "spam@domain".. This is surprisingly blocked in a lot of instances.

owenmarshall · on May 24, 2020

This looks overly opinionated: I can’t have co-op but I could have accounting?

I’d rather see a list with justifications.

CGamesPlay · on May 25, 2020

Why would they disallow every language name? This list says it's for "security concerns"... are they trying to curb social engineering by just disallowing words?

[append]

Hmm, I guess this list should be declared as "should not be allowed to be created by users of SHARED email hosts".

chucksmash · on May 25, 2020

Interesting that "zuck" is disallowed.

montroser · on May 24, 2020

Almost related: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...

dorgo · on May 24, 2020

What? only 151 russian words? The russians have an own dedicated sub-language which consists solely out of bad words. No idea or concept is too complicated to be expressed in bad words alone. They switch from normal russian to bad words russian as soon as the situation allowes it.

egypturnash · on May 25, 2020

[citation needed], please, I wanna read about this beautiful gem of a sub-language!

All I can find with some quick searching is Wikipedia's page on 'mat'[1], which seems to be pretty similar to Carlin's list of Seven Words You Can't Say On TV[2] rather than an entire language of vulgarity.

1: https://en.wikipedia.org/wiki/Mat_(Russian_profanity) 2: https://en.wikipedia.org/wiki/Seven_dirty_words

owl57 · on May 25, 2020

Unusual richness of bad Russian words is, I believe, mostly a myth from our stand-up comedians based on the loose facts that (1) there's not one central f-word, but several equally important roots, and (2) technically a lot of Russian words can correspond to one English word due to heavier use of prefixes ("fuck off" is written without whitespace in Russian).

egypturnash · on May 25, 2020

Damn, I was suspecting something like that was the case from reading the WP pages I linked to and some of the ones they reference! Handing my SO an article about the intricate sub-language of Russian invective would have been a PERFECT piece of flirtation. (Yes we are nerds.)

dorgo · on May 25, 2020

I don't really have a citation, just personal experiences. But your wiki link already states: "David Remnick believes that mat has thousands of variations"

The german version of the wiki article has some examples:

пиздеть (pisdet′) = to (tell a) lie, but also possible: to steal.

пиздец (pisdez) = ruin or catastrophe, fubar, fucked up

As you can see, small variations of the same word have different meanings. And the meaning can vary with context.

Edit:

The russian version of the article has some example verbs (all derived from the same word): I tried to translate as far as I understand.

ебануть = ?,

ебануться = to get stupid ( to say or do something stupid ),

ебаться = to do something - very generic,

ебиздить = ?,

ёбнуть = ?,

ёбнуться = ?,

ебстись = ?,

въебать = to get something into something?,

выебать = to get something out of something?,

выёбываться = to fuck around. To decline something to somebody.

доебать = ?

доебаться = to annoy somebody by asking too many questions or trying to get something out of somebody.

доёбывать = to bring something to an end. to finish something,

заебать = to annoy somebody, to get to somebody

заебаться = to get fed up with something, to get tired of something,

наебать = to betray somebody?,

наебаться = to get saturated by something, to get enough of something

наебнуть = to cheat sombody,

наебнуться = for the fun of it, just do it,

объебать = to get a piece of something, to get familiar with something?,

объебаться = to get into something nasty, to fuck up,

остоебенить = blow you mind (not sure about this one)

остоебеть = ?,

отъебать = to fuck with something/somebody, to fight, to degrade somebody,

отъебаться = get rid of something, for example to pass exams or get rid of obligations,

переебать = (maybe) to understand something,

переебаться = to fuck over something, to get something done,

поебать = do something (depending on context),

поебаться = to do nonsense, try hard to no avail, fool around,

подъебать = ?,

подъебаться = ?,

подъебнуть = to make joke, for example april the 1., even flirt?,

разъебать = to ( accidantly or pusposfully ) break something,

разъебаться = to clear a relationship, to bring everything to order?,

съебать = to get away (maybe in a cool way),

съебаться ( to fuck off, to stop anoing ),

уебать = to run away (maybe after stealing something)

teetertater · on May 25, 2020

This list mostly corresponds to my personal experiences as a native Russian speaker, and I've heard 70% used in the wild. To fill two of the ?'s:

  ёбнуть = to f*ck

  ёбнуться = to stumble or injure oneself

What's interesting is that this whole list only uses one root word, of which there are 3 more

egypturnash · on May 25, 2020

Thanks! Shit, that’s a fuckton of sweary words.

jzl · on May 24, 2020

Hilarious, but also important!

Udik · on May 24, 2020

.

ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็

Wow, what's this? :)

majewsky · on May 24, 2020

Layers upon layers of combining diacritics.

nwallin · on May 25, 2020

Have you tried using regular expressions to parse HTML?

https://stackoverflow.com/questions/1732348/regex-match-open...

peteretep · on May 25, 2020

Thai tone marks. On my browser they’re spaced out using a dotted circle, as you should really only ever have one per character

EvanAnderson · on May 24, 2020

It reminds me a little bit of Feynman diagrams.

13415 · on May 24, 2020

I don't quite understand the purpose of this list. It contains potentially malicious input, but also emoticons based on Unicode characters that are completely harmless and used in every second post on Reddit.

minimaxir · on May 24, 2020

I made the list while I was a Software QA Engineer at Apple, since there were a bunch of fun Unicode strings causing particular issues there, which gave me the idea.

kube-system · on May 24, 2020

I think the purpose is to run these strings through your inputs and make sure it doesn’t behave in unexpected ways.

MauranKilom · on May 24, 2020

It's essentially a test suite for character encoding all throughout your application. If you input all those strings (e.g. send chat message) and they arrive incorrectly at some other end (e.g. other user receiving chat message) then there's a problem somewhere.

13415 · on May 24, 2020

That makes sense. Thanks a lot! Of course, it's very useful for testing. I erroneously assumed it was for input validation.

Dylan16807 · on May 24, 2020

There's a lot of ways to mishandle unicode. Checking that non-BMP characters work, that emojis in various sections work, and that emojis with modifiers work are all good tests.

toomanybeersies · on May 24, 2020

It's useful for testing a variety of things that take text/string inputs, such as forms in web applications. It's a handy tool for testing a site (preferably one you have permission to test) for XSS or SQL injection, character encoding problems, or even just form length problems.

chris_wot · on May 24, 2020

Strongly advise not using cat on the list, you will get beeped at.

fareesh · on May 24, 2020

would that be considered animal abuse :D

monax · on May 24, 2020

Yup, can't view the file using the GitHub app for Android

minimaxir · on May 24, 2020

Out of curiosity, what happens when you try to do so?

Johnjonjoan · on May 24, 2020

Something went wrong

<button>TRY AGAIN<button>

Edit: as far as I could see it's only opening blns.txt that causes this error the other files are fine in the app.

gitnewbie · on May 25, 2020

The latest commit message of README is "Merge branch 'master' into master" [1]. As someone who doesn't do git, what does that even mean? Does git allows multiple branches with same names?

[1]: https://github.com/minimaxir/big-list-of-naughty-strings/com...

colinchartier · on May 25, 2020

The GitHub flavor of git essentially has two parts:

Repository (i.e., minimaxir/big-list-of-naughty-strings), and branch (i.e., master)

In this case, someone merged eliabieri/big-list-of-naughty-strings at master to minimaxir/big-list-of-naughty-strings at master via a pull request [1].

[1] https://github.com/minimaxir/big-list-of-naughty-strings/com...

kenhwang · on May 25, 2020

Usually it means they merged 'origin/master' with their local 'master'.

Git is distributed, so everyone can have their own copy of the same branch.

gerdesj · on May 24, 2020

Jimmy Clitheroe - the Clitheroe Kid. That brings back some memories. It's also nice to see that England is suitably represented in the place names, obviously Scunthorpe is the classic. I'll tender Somerset for first amongst equals for daft and downright odd place names.

bloody-crow · on May 25, 2020

We need a similar list of weird unconventional emails to make sure every new registration form won't erroneously reject a valid email.

The number of times I get validation errors or some unexpected crashes when I enter my fairly pedestrian email with + sign in it... Jeez.

toolslive · on May 24, 2020

https://github.com/minimaxir/big-list-of-naughty-strings/blo...

just lovely ;)

duggable · on May 24, 2020

This one got me:

> "If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."

Strangely terrifying....

ball_of_lint · on May 24, 2020

Eh, it's just a meme:

https://www.reddit.com/r/copypasta/comments/5we0ny/if_youre_...

I guess it is intriguing, in a Roko's Basilisk sort of way.

willismichael · on May 24, 2020

The question is, which one of us is the message meant for?

naniwaduni · on May 24, 2020

Why would there be more than one of you?

myself248 · on May 24, 2020

Why are there so many of me posting in this thread?

jbay808 · on May 24, 2020

Username checks out.

was_boring · on May 24, 2020

Who says it's only meant for one?

bryanrasmussen · on May 24, 2020

the real question is: Inception. Can it be done?

bryanrasmussen · on May 24, 2020

I did feel sort of let down that they didn't have man-hole cover.

on edit: yeah, I'm not gonna send a pull request on that one.

hunter2_ · on May 25, 2020

Neat. I recently found that googling the Japanese Post Office emoji results in a totally borked SERP (cross-browser, desktop, including desktop mode on Android Chrome). I assume there are other characters as well.

bullen · on May 25, 2020

I just do:

  .matches("[a-zA-Z0-9.\\-]+")

And prepared statements or my own NoSQL.