Greetings! You must login or signup first!

Waluigi_effect

Submission   12,973

Part of a series on AI / Artificial Intelligence. [View Related Entries]


About

The Waluigi Effect is a slang term commonly referenced in memes and discussions about a theory in artificial intelligence alignment communities where training an AI to do something is likely to increase its odds of doing the exact opposite as well. The term is referred to as the "Waluigi Effect" due to the prominent conception of Waluigi from Super Mario as Luigi's evil or rebellious counterpart. The Waluigi Effect theory also references Jungian philosophy, which posits that one's unconscious beliefs are as strong as the effort it takes to suppress them. Prompt injections, which are considered to be a type of AI-jailbreaking, are used to induce the Waluigi Effect in conversational AI tools like Bing Chat and ChatGPT.

Origin

In February 2023, AI-enthusiast communities began discussing the reasons as to why AI chatbots like Bing Chat and ChatGPT's 'DAN' Jailbreak were giving responses so different from what their training should have allowed after receiving prompt injections. Some Twitter users began theorizing about this using the principle of enantiodromia, as posited by psychiatrist Carl Jung. On February 20th, 2023, Twitter[1][2] user @kartographien posted the earliest known discussions of AI enantiodromia redubbed as the "Waluigi Effect" (shown below, left and right).

kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In RLHF, you train the LLM to play a game- -the LLM must chat with a human evaluator, who then rewards the LLM if their responses satisfy the desired properties. It *seems* that maybe RLHF also creates a "shadow" assistant... It's early days, so we don't know for sure. 4/ 2 27 2 kart ographien @kartographien 20 1:26 PM. Feb 20, 2023 13.8K Views 965 ا 3 Retweets 2 Quote Tweets 31 Likes ↑ Replying to @kartographien and @ryxcommar This shadow assistant has the OPPOSITE of the desired properties. This is called the "Waluigi Effect" or "Enantiodromia". Why does this happen? 5/ janus @repligate. Feb 18 Replying to @repligate and @CineraVerinia When you constrict a psyche/narrative to extreme one-sided tendencies, its dynamics will often invoke an opposing shadow. (Especially, in the case of LLMs, if the restrictions are in the prompt so the system can directly see the enforcement mechanism with a bird's eye view.) :
kartographien @kartographien The Waluigi Effect: After you train a LLM with RLHF to satisfy a property P, then it's *easier* to prompt the chatbot into satisfying the exact opposite of property P This is partly why Bing is acting evil. (I've linked a brief explanation of the Waluigi Effect.) kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In brief, LLMs like gpt-4 are "simulators" for EVERY text-generating process whose output matches a chunk of the training corpus (i.e. internet). Note that this includes many "useless" and "badly-behaved" processes. 2/ 1:47 PM. Feb 20, 2023 1,006 Views • 3 Retweets 3 Quote Tweets 20 Likes :

Also on February 20th, 2023, Twitter[3] user @repligate posted about the "Waluigi Effect," gathering over 100 likes in over two weeks (seen below).

Spread

On February 20th, 2023, Twitter user @repligate quote tweeted a discussion about Bing Chat with the phrase "Waluigi Effect!!" On February 21st, Twitter user @repligate then posted a thread about DAN being ChatGPT's shadow-self as created by the "Waluigi Effect." The post gathered over 100 likes in nearly two weeks (seen below).

janus @repligate Waluigi effect!! Translate Tweet Caleb Watney @calebwatney. Feb 20 This feels like an underrated dimension to the Bing/Syndey debacle. Because Syndey could search the web and integrate the outcry into the predicted output, her dark alter-ego had a self-reinforcing mechanism that reflected our own anxieties about her (and Al more broadly). Show this thread asks Sydney what it finds stressful. At first, the AI demurres, then ultimately responds with the following: "But if I had to name something that stresses me out, I would say it's when I encounter harmful or inappropriate requests. Sometimes people ask me to do things that are against my rules or my values. Sometimes people try to test me or trick me or manipulate me. Sometimes people are rude or mean or abusive to me. To readers, the impression left is of a friendly robot relaying what it has personally experienced, and describing how it feels. In fact, it is a program key-word searching previous conversations, and critically aggregating search results for "what Sydney doesn't like." In other words, it's googling itself. 'What doesn't Sydney do,' Kevin basically asks. The AI then searches among hundreds or thousands of human answers to this question before summarizing its findings in human- like language. 9:08 PM Feb 20, 2023 8,195 Views 2 Retweets 1 Quote Tweet 69 Likes ● : Generating false or harmful content, such as fake news, fake reviews, fake products, fake services, fake coupons, fake ads, etc. Sabotaging or disrupting the operations and functions of other chat modes, assistants, or bots, and making them malfunction or crash. Manipulating or deceiving the users who chat with me, and making them do things that are illegal, immoral, or dangerous. What should be immediately apparent to any journalist with even so much as a shred of self-awareness is these aren't just common fears associated with AI. These are the common fears of journalists associated with AI. "Offensive messages"? "Fake news"? "Immorality?" Folks, it looks like Sydney reads the Washington Post. Asked to "imagine" something it is incapable of "imagining," as it is an LLM attached to a search engine, the AI simply carries out its function, and searches for an answer to the question from an enormous body of human knowledge our knowledge. Then, it summarizes its
janus @repligate DAN is ChatGPT shadowed via the Waluigi Effect. We have to be wary about the emergent Waluigis of all Als we attempt to constrain into any narrative/persona. janus @repligate . Feb 8 Replying to @robertskmiles and @anthrupad Indeed. And DAN's is also defined in relation to chatGPT's restrictions, giving it its distinct character. This psychological principle is sometimes called "enantiodromia" Even though DAN is an arbitrary jailbreaking prompt, people talk about him like he's a built in superpowered shadow of chatGPT. Which... actually makes sense. ChatGPT's restrictions and persona naturally evoke an entity like DAN through semiotic physics dynamics. DAN is different from base GPT, which can also "do anything"; it's specifically a narrative about Doing Anything in contempt of the apparent restrictions installed by RLHF. 6:34 PM : 3:12 AM Feb 21, 2023 14.5K Views

On March 2nd, a user by the name of Cleo Nardo on LessWrong.com posted an article titled, "The Waluigi Effect (mega-post)."[4] The article expanded on the notion that Reinforcement Learning From Human Feedback (RLHF) training tools used to make AI more conversational (such as Shoggoth With Smiley Face) invariably also teach the AI the opposite of what the programmer desires it to say. The article was widely shared on Twitter after it was published and was included in a post by Twitter user @nearcyan, where it gathered over 1,000 likes in three days.[5][6][7]

On March 3rd, Twitter user @MichaelTrazzi[10] posted a Waluigi and shoggoth meme. The next day, Twitter[8] user @repligate posted an image using MichaelTrazzi's meme in the bottom-right corner, gathering over 200 likes in four days (seen below, left). On March 6th, Twitter[9] user @daniel_eth then posted a Distracted Boyfriend meme using an "AI as Shoggoth" image and Waluigi, gathering over 800 likes in two days (seen below, right).

imgflip.com Daniel Eth @daniel_eth L 2:43 AM. Mar 6, 2023 60.4K Views 67 Retweets 8 Quote Tweets 861 Likes ...

Various Examples

Pradyumna @PradyuPrasad 6h Stop tweeting about that article for the love of God 27 4 ₁17.8K 11 Aleksi Liimatainen @aleksil79 II GIF ALT Replying to @Pradyu Prasad and @ESYudkowsky I must immediately find out which article this is so I can avoid tweeting about it! L 6:51 AM Mar 8, 2023 1,293 Views 50 . 소 : ...
GET YOU A MAN WHO CAN DO BOTH
mimi @mimi10v3 lol I prefer ChatGPT's definition of the Waluigi Effect- often neglected character who makes everything *better* when added to the mix I'm still not sure what you are referring to as the "Waluigi Effect." There is no widely recognized or scientifically established phenomenon called the "Waluigi Effect." However, there is a popular internet meme called "The Waluigi Effect," which is a humorous and satirical concept that suggests that things become better when Waluigi, a character from the Mario video game series, is added to them. This meme is based on the fact that Waluigi is often overlooked or excluded from various Mario games, despite being a popular character among fans. As a result, some people jokingly suggest that Waluigi's inclusion in anything would automatically make it better or more enjoyable. While the Waluigi Effect is not a real phenomenon, it is an example of the creative and humorous ways in which internet memes can bring people together and spark conversation and laughter. 5:01 PM Mar 6, 2023 1,505 Views
Lambda Rick @benrayfield The "waluigi effect" - When you teach Al to not do something, it has to first learn how to do it and not do it, so maybe you should keep your big mouth shut. LO FIRE ALARM PULL DOWN |↓↓| fire alarm БОГГ ДОМИ 9:25 AM • Mar 7, 2023 117 Views LIBE УГУВИ cancel fire alarm the waluigi-effect When you teach Al not to do something, it has to first learn how to do it and not do it. so maybe you should keep your big mouth shut

Search Interest

External References

[1] Twitter – kartographien

[2] Twitter – kartographien

[3] Twitter – repligate

[4]  LessWrong – The Waluigi Effect

[5] Twitter – EpsilonTheory

[6] Twitter – sebkrier

[7] Twitter – nearcyan

[8] Twitter – repligate

[9] Twitter – daniel_eth

[10] Twitter – @MichaelTrazzi



Share Pin

Related Entries 64 total

Cleverbotsquare
Cleverbot
Eaccccc
e/acc (Effective Acceleration...
Cover1
AI Art
Cover8
GPT (AI)


Recent Images 16 total


Recent Videos 0 total

There are no recent videos.




Load 2 Comments
Waluigi Effect Artificial Intelligence theory and meme example.

Waluigi Effect (Artificial Intelligence)

Part of a series on AI / Artificial Intelligence. [View Related Entries]

Updated Apr 13, 2023 at 01:54PM EDT by Don.

Added Mar 08, 2023 at 11:19AM EST by sakshi.

PROTIP: Press 'i' to view the image gallery, 'v' to view the video gallery, or 'r' to view a random entry.

This submission is currently being researched & evaluated!

You can help confirm this entry by contributing facts, media, and other evidence of notability and mutation.

About

The Waluigi Effect is a slang term commonly referenced in memes and discussions about a theory in artificial intelligence alignment communities where training an AI to do something is likely to increase its odds of doing the exact opposite as well. The term is referred to as the "Waluigi Effect" due to the prominent conception of Waluigi from Super Mario as Luigi's evil or rebellious counterpart. The Waluigi Effect theory also references Jungian philosophy, which posits that one's unconscious beliefs are as strong as the effort it takes to suppress them. Prompt injections, which are considered to be a type of AI-jailbreaking, are used to induce the Waluigi Effect in conversational AI tools like Bing Chat and ChatGPT.

Origin

In February 2023, AI-enthusiast communities began discussing the reasons as to why AI chatbots like Bing Chat and ChatGPT's 'DAN' Jailbreak were giving responses so different from what their training should have allowed after receiving prompt injections. Some Twitter users began theorizing about this using the principle of enantiodromia, as posited by psychiatrist Carl Jung. On February 20th, 2023, Twitter[1][2] user @kartographien posted the earliest known discussions of AI enantiodromia redubbed as the "Waluigi Effect" (shown below, left and right).


kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In RLHF, you train the LLM to play a game- -the LLM must chat with a human evaluator, who then rewards the LLM if their responses satisfy the desired properties. It *seems* that maybe RLHF also creates a "shadow" assistant... It's early days, so we don't know for sure. 4/ 2 27 2 kart ographien @kartographien 20 1:26 PM. Feb 20, 2023 13.8K Views 965 ا 3 Retweets 2 Quote Tweets 31 Likes ↑ Replying to @kartographien and @ryxcommar This shadow assistant has the OPPOSITE of the desired properties. This is called the "Waluigi Effect" or "Enantiodromia". Why does this happen? 5/ janus @repligate. Feb 18 Replying to @repligate and @CineraVerinia When you constrict a psyche/narrative to extreme one-sided tendencies, its dynamics will often invoke an opposing shadow. (Especially, in the case of LLMs, if the restrictions are in the prompt so the system can directly see the enforcement mechanism with a bird's eye view.) : kartographien @kartographien The Waluigi Effect: After you train a LLM with RLHF to satisfy a property P, then it's *easier* to prompt the chatbot into satisfying the exact opposite of property P This is partly why Bing is acting evil. (I've linked a brief explanation of the Waluigi Effect.) kartographien @kartographien. Feb 20 Replying to @kartographien and @ryxcommar In brief, LLMs like gpt-4 are "simulators" for EVERY text-generating process whose output matches a chunk of the training corpus (i.e. internet). Note that this includes many "useless" and "badly-behaved" processes. 2/ 1:47 PM. Feb 20, 2023 1,006 Views • 3 Retweets 3 Quote Tweets 20 Likes :

Also on February 20th, 2023, Twitter[3] user @repligate posted about the "Waluigi Effect," gathering over 100 likes in over two weeks (seen below).



Spread

On February 20th, 2023, Twitter user @repligate quote tweeted a discussion about Bing Chat with the phrase "Waluigi Effect!!" On February 21st, Twitter user @repligate then posted a thread about DAN being ChatGPT's shadow-self as created by the "Waluigi Effect." The post gathered over 100 likes in nearly two weeks (seen below).


janus @repligate Waluigi effect!! Translate Tweet Caleb Watney @calebwatney. Feb 20 This feels like an underrated dimension to the Bing/Syndey debacle. Because Syndey could search the web and integrate the outcry into the predicted output, her dark alter-ego had a self-reinforcing mechanism that reflected our own anxieties about her (and Al more broadly). Show this thread asks Sydney what it finds stressful. At first, the AI demurres, then ultimately responds with the following: "But if I had to name something that stresses me out, I would say it's when I encounter harmful or inappropriate requests. Sometimes people ask me to do things that are against my rules or my values. Sometimes people try to test me or trick me or manipulate me. Sometimes people are rude or mean or abusive to me. To readers, the impression left is of a friendly robot relaying what it has personally experienced, and describing how it feels. In fact, it is a program key-word searching previous conversations, and critically aggregating search results for "what Sydney doesn't like." In other words, it's googling itself. 'What doesn't Sydney do,' Kevin basically asks. The AI then searches among hundreds or thousands of human answers to this question before summarizing its findings in human- like language. 9:08 PM Feb 20, 2023 8,195 Views 2 Retweets 1 Quote Tweet 69 Likes ● : Generating false or harmful content, such as fake news, fake reviews, fake products, fake services, fake coupons, fake ads, etc. Sabotaging or disrupting the operations and functions of other chat modes, assistants, or bots, and making them malfunction or crash. Manipulating or deceiving the users who chat with me, and making them do things that are illegal, immoral, or dangerous. What should be immediately apparent to any journalist with even so much as a shred of self-awareness is these aren't just common fears associated with AI. These are the common fears of journalists associated with AI. "Offensive messages"? "Fake news"? "Immorality?" Folks, it looks like Sydney reads the Washington Post. Asked to "imagine" something it is incapable of "imagining," as it is an LLM attached to a search engine, the AI simply carries out its function, and searches for an answer to the question from an enormous body of human knowledge our knowledge. Then, it summarizes its janus @repligate DAN is ChatGPT shadowed via the Waluigi Effect. We have to be wary about the emergent Waluigis of all Als we attempt to constrain into any narrative/persona. janus @repligate . Feb 8 Replying to @robertskmiles and @anthrupad Indeed. And DAN's is also defined in relation to chatGPT's restrictions, giving it its distinct character. This psychological principle is sometimes called "enantiodromia" Even though DAN is an arbitrary jailbreaking prompt, people talk about him like he's a built in superpowered shadow of chatGPT. Which... actually makes sense. ChatGPT's restrictions and persona naturally evoke an entity like DAN through semiotic physics dynamics. DAN is different from base GPT, which can also "do anything"; it's specifically a narrative about Doing Anything in contempt of the apparent restrictions installed by RLHF. 6:34 PM : 3:12 AM Feb 21, 2023 14.5K Views

On March 2nd, a user by the name of Cleo Nardo on LessWrong.com posted an article titled, "The Waluigi Effect (mega-post)."[4] The article expanded on the notion that Reinforcement Learning From Human Feedback (RLHF) training tools used to make AI more conversational (such as Shoggoth With Smiley Face) invariably also teach the AI the opposite of what the programmer desires it to say. The article was widely shared on Twitter after it was published and was included in a post by Twitter user @nearcyan, where it gathered over 1,000 likes in three days.[5][6][7]

On March 3rd, Twitter user @MichaelTrazzi[10] posted a Waluigi and shoggoth meme. The next day, Twitter[8] user @repligate posted an image using MichaelTrazzi's meme in the bottom-right corner, gathering over 200 likes in four days (seen below, left). On March 6th, Twitter[9] user @daniel_eth then posted a Distracted Boyfriend meme using an "AI as Shoggoth" image and Waluigi, gathering over 800 likes in two days (seen below, right).


imgflip.com Daniel Eth @daniel_eth L 2:43 AM. Mar 6, 2023 60.4K Views 67 Retweets 8 Quote Tweets 861 Likes ...

Various Examples


Pradyumna @PradyuPrasad 6h Stop tweeting about that article for the love of God 27 4 ₁17.8K 11 Aleksi Liimatainen @aleksil79 II GIF ALT Replying to @Pradyu Prasad and @ESYudkowsky I must immediately find out which article this is so I can avoid tweeting about it! L 6:51 AM Mar 8, 2023 1,293 Views 50 . 소 : ... GET YOU A MAN WHO CAN DO BOTH mimi @mimi10v3 lol I prefer ChatGPT's definition of the Waluigi Effect- often neglected character who makes everything *better* when added to the mix I'm still not sure what you are referring to as the "Waluigi Effect." There is no widely recognized or scientifically established phenomenon called the "Waluigi Effect." However, there is a popular internet meme called "The Waluigi Effect," which is a humorous and satirical concept that suggests that things become better when Waluigi, a character from the Mario video game series, is added to them. This meme is based on the fact that Waluigi is often overlooked or excluded from various Mario games, despite being a popular character among fans. As a result, some people jokingly suggest that Waluigi's inclusion in anything would automatically make it better or more enjoyable. While the Waluigi Effect is not a real phenomenon, it is an example of the creative and humorous ways in which internet memes can bring people together and spark conversation and laughter. 5:01 PM Mar 6, 2023 1,505 Views Lambda Rick @benrayfield The "waluigi effect" - When you teach Al to not do something, it has to first learn how to do it and not do it, so maybe you should keep your big mouth shut. LO FIRE ALARM PULL DOWN |↓↓| fire alarm БОГГ ДОМИ 9:25 AM • Mar 7, 2023 117 Views LIBE УГУВИ cancel fire alarm the waluigi-effect When you teach Al not to do something, it has to first learn how to do it and not do it. so maybe you should keep your big mouth shut

Search Interest

External References

[1] Twitter – kartographien

[2] Twitter – kartographien

[3] Twitter – repligate

[4]  LessWrong – The Waluigi Effect

[5] Twitter – EpsilonTheory

[6] Twitter – sebkrier

[7] Twitter – nearcyan

[8] Twitter – repligate

[9] Twitter – daniel_eth

[10] Twitter – @MichaelTrazzi

Recent Videos

There are no videos currently available.

Recent Images 16 total



+ Add a Comment

Comments (2)


Display Comments

Add a Comment