Solid list for a quick SQL injection and XSS reference with lots of examples. Even unicode/accents/two-byte characters etc are super useful to check handling on all the way from the front-end to the persistent storage solution (DB, etc).
Lost it laughing at "Human Injection" section:
> # Strings which may cause human to reinterpret worldview
> If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.
That was a bastardization of the original one on 4chan's /x/. This one is the real one:
> It has been reported that some victims of torture, during the act, would retreat into a fantasy world from which they could not WAKE UP. In this catatonic state, the victim lived in a world just like their normal one, except they weren’t being tortured. The only way that they realized they needed to WAKE UP was a note they found in their fantasy world. It would tell them about their condition, and tell them to WAKE UP. Even then, it would often take months until they were ready to discard their fantasy world and PLEASE WAKE UP.
The /x/ copypasta was the one I saw first sometime around 2007, and only later did I see the one quoted the file. It's possible that the file's story was earlier.
I've been curious about this sort of thing, but I haven't been able to find good info on it. For example, it seems to me that the format of "It's not x, it's y" is widely accepted. I'm not sure if that's by simple common use or by way of some nuance that I'm not aware of. Certainly in some cases it can be difficult to avoid both comma splices and sounding stilted. I'd love to know more -- I don't seem to recall anything about it from when I had a copy of the Chicago Manual of Style.
There was a long tradition in English and maybe also in other European languages of choosing punctuation depending on how long a pause a speaker would take at the point, with a comma denoting a shorter pause than a semicolon, which in turn denoted a shorter pause than a colon, which in turn denoted a shorter pause than a period -- instead of using punctuation to convey some semantic distinction as most writers do today. The comma splice might be a vestige of that tradition.
But the pauses are semantic in a way. Those concepts cannot be well separated. Not a linguist, but I can tell pauses are used by speakers and listeners for a variety of reasons, linked with both semantics and delivery -- to separate parts, to give and demarcate "processing time" for speakers and listeners, to create and satisfy expectation, etc. The cadence of speech is semantic (though its semantics are not of a simple propositional logic type).
For long I had the same problem with Japanese language files being shown as binary in the GitHub diffs, and it is solved by having something like this in the .gitattributes file
It's a very unusual "character" as it can sometimes be rendered very, very wide- the same width as maybe 20 or so "normal" characters. (whatever "normal" means. The more I learn about unicode...) The same width as if you had typed it out in a sentence.
It breaks a lot of UIs that use character count as a proxy for string width.
Given that writing systems we have today are pre-digital systems of marks made with the human body and tools, all encoding of text in our HCI is a simulation of writing. Simulations by nature cannot be 100% accurate. choices have to be made about what is represented and simulated. choices have to be made about what to leave out. Simulations are a form of communication, they put forward a perspective about what is the important part of what is being simulated.
Seeing a symbol like "﷽" really brought home for me some of the assumptions about writing that I take for granted, and are inherently "argued for" by the way text is handled in our digital systems.
I don’t think that’s right. I looked into the way Twitter counts characters when I was trying to work out the largest prime number that could be written out in full, in base ten, in a single tweet[1]; the rules are more complicated than you might expect, and have changed several times.
The current rule seems to be that all Unicode characters count as two, except for the ranges 0–4351, 8192–8205, 8208–8223 and 8242–8247 which count as one.
Good point! Still, I could swear I saw someone (@FakeUnicode?) do exactly this once, but of course I can’t find that tweet any more, partly because it turns out that search engines don’t handle ﷽ well at all, and I don’t feel like testing it on my own followers somehow.
Edit: it looks like it might count it as two characters, so that’s only 140 per tweet.
That’s definitely possible! @FakeUnicode mentioned in the discussion that, when 280-character tweets were first introduced in September 2017, it was possible to tweet 280 single-codepoint emoji using TweetDeck.
I encountered an amusing instance of this recently watching my six-year-old son playing music on the kitchen Alexa. Alexa felt it was necessary to censor the name of a children’s song entitled, “Pussy Cat, Pussy Cat.”
When I saw the title I thought it was a list of profanity that one might want to filter out from an open web application (i.e. a list that also includes swear words from multiple languages).
I use a catch-all and mark some handles as spam once a website has been hacked, or the email has been caught by spammers.
I trust you've probably adopted a similar workaround for problem websites.
For throwaway websites (where I use a password I have no intention of keeping track of), I often sign up as "spam@domain".. This is surprisingly blocked in a lot of instances.
Why would they disallow every language name? This list says it's for "security concerns"... are they trying to curb social engineering by just disallowing words?
[append]
Hmm, I guess this list should be declared as "should not be allowed to be created by users of SHARED email hosts".
What? only 151 russian words? The russians have an own dedicated sub-language which consists solely out of bad words. No idea or concept is too complicated to be expressed in bad words alone. They switch from normal russian to bad words russian as soon as the situation allowes it.
[citation needed], please, I wanna read about this beautiful gem of a sub-language!
All I can find with some quick searching is Wikipedia's page on 'mat'[1], which seems to be pretty similar to Carlin's list of Seven Words You Can't Say On TV[2] rather than an entire language of vulgarity.
Unusual richness of bad Russian words is, I believe, mostly a myth from our stand-up comedians based on the loose facts that (1) there's not one central f-word, but several equally important roots, and (2) technically a lot of Russian words can correspond to one English word due to heavier use of prefixes ("fuck off" is written without whitespace in Russian).
Damn, I was suspecting something like that was the case from reading the WP pages I linked to and some of the ones they reference! Handing my SO an article about the intricate sub-language of Russian invective would have been a PERFECT piece of flirtation. (Yes we are nerds.)
I don't really have a citation, just personal experiences. But your wiki link already states: "David Remnick believes that mat has thousands of variations"
The german version of the wiki article has some examples:
пиздеть (pisdet′) = to (tell a) lie, but also possible: to steal.
пиздец (pisdez) = ruin or catastrophe, fubar, fucked up
As you can see, small variations of the same word have different meanings. And the meaning can vary with context.
Edit:
The russian version of the article has some example verbs (all derived from the same word):
I tried to translate as far as I understand.
ебануть = ?,
ебануться = to get stupid ( to say or do something stupid ),
ебаться = to do something - very generic,
ебиздить = ?,
ёбнуть = ?,
ёбнуться = ?,
ебстись = ?,
въебать = to get something into something?,
выебать = to get something out of something?,
выёбываться = to fuck around. To decline something to somebody.
доебать = ?
доебаться = to annoy somebody by asking too many questions or trying to get something out of somebody.
доёбывать = to bring something to an end. to finish something,
заебать = to annoy somebody, to get to somebody
заебаться = to get fed up with something, to get tired of something,
наебать = to betray somebody?,
наебаться = to get saturated by something, to get enough of something
наебнуть = to cheat sombody,
наебнуться = for the fun of it, just do it,
объебать = to get a piece of something, to get familiar with something?,
объебаться = to get into something nasty, to fuck up,
остоебенить = blow you mind (not sure about this one)
остоебеть = ?,
отъебать = to fuck with something/somebody, to fight, to degrade somebody,
отъебаться = get rid of something, for example to pass exams or get rid of obligations,
переебать = (maybe) to understand something,
переебаться = to fuck over something, to get something done,
поебать = do something (depending on context),
поебаться = to do nonsense, try hard to no avail, fool around,
подъебать = ?,
подъебаться = ?,
подъебнуть = to make joke, for example april the 1., even flirt?,
разъебать = to ( accidantly or pusposfully ) break something,
разъебаться = to clear a relationship, to bring everything to order?,
съебать = to get away (maybe in a cool way),
съебаться ( to fuck off, to stop anoing ),
уебать = to run away (maybe after stealing something)
I don't quite understand the purpose of this list. It contains potentially malicious input, but also emoticons based on Unicode characters that are completely harmless and used in every second post on Reddit.
I made the list while I was a Software QA Engineer at Apple, since there were a bunch of fun Unicode strings causing particular issues there, which gave me the idea.
It's essentially a test suite for character encoding all throughout your application. If you input all those strings (e.g. send chat message) and they arrive incorrectly at some other end (e.g. other user receiving chat message) then there's a problem somewhere.
There's a lot of ways to mishandle unicode. Checking that non-BMP characters work, that emojis in various sections work, and that emojis with modifiers work are all good tests.
It's useful for testing a variety of things that take text/string inputs, such as forms in web applications. It's a handy tool for testing a site (preferably one you have permission to test) for XSS or SQL injection, character encoding problems, or even just form length problems.
The latest commit message of README is "Merge branch 'master' into master" [1]. As someone who doesn't do git, what does that even mean? Does git allows multiple branches with same names?
The GitHub flavor of git essentially has two parts:
Repository (i.e., minimaxir/big-list-of-naughty-strings), and branch (i.e., master)
In this case, someone merged eliabieri/big-list-of-naughty-strings at master to minimaxir/big-list-of-naughty-strings at master via a pull request [1].
Jimmy Clitheroe - the Clitheroe Kid. That brings back some memories. It's also nice to see that England is suitably represented in the place names, obviously Scunthorpe is the classic. I'll tender Somerset for first amongst equals for daft and downright odd place names.
> "If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."
Neat. I recently found that googling the Japanese Post Office emoji results in a totally borked SERP (cross-browser, desktop, including desktop mode on Android Chrome). I assume there are other characters as well.
Lost it laughing at "Human Injection" section:
> # Strings which may cause human to reinterpret worldview
> If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.