Chapter 11: Big Language

Photo by Jacqueline Brandwayn on Unsplash

Advances in natural language processing (NLP) and Big Data techniques have allowed us to learn about the human mind through one of its richest outputs – language. In this chapter, we introduce the field of computational linguistics and go through examples of how to find natural language and how to interpret the complexities that are present within it. The chapter discusses the major state-of-the-art methods being applied in NLP and how they can be applied to psychological questions, including statistical learning, n-gram models, word embedding models, large-language models, topic modeling, and sentiment analysis. The chapter concludes with ethical questions on the proliferation of chat "bots" that pervade our social networks and the importance of balanced training sets for NLP models.

Learn how large language models work

Check out this blog post by Microsoft data scientist Andreas Stoffelbauer of how LLMs work.
Check out this blog by Jay Alammar as well, which also has a short YouTube tutorial series and a printed book on LLMs.

Try out examples of LLMs and natural language processing

Here is a list of current LLMs and LLM tools to try. Because LLMs are exploding, these systems are constantly changing (and the space is getting a little crowded!):

ChatGPT - one of the most well-known LLMs made by OpenAI.
Claude - developed by Anthropic
Gemini - developed by Google
Mistral
Claude - developed by Anthropic
AI Dungeon - have an AI-generated choose-your-path adventure
Dishgen - use AI to do meal-planning

Visualize N-grams, embeddings, and analogies

Visualize trends in N-gram usage in text across time here with the Google Books Ngram Viewer
Visualize word emmbeddings and analogies in a geometric space with this word embedding demo by Prof. Dave Touretzky.

Is the text you're reading AI generated?

Interestingly, since writing about the topic of bots-fighting-bots for the book, the two main services for using AI to spot fake AI-generated reviews have been shut down (Review Meta and FakeSpot). It will be interesting to see if a new service fills this void. It does seem like one option is Fake Find-- an AI-powered service for spotting fake (AI-generated) reviews. I have not tested its quality, though.
There are also several services aimed at detecting AI generated text more broadly. The only one I have used is GPTZero but it seems the space is getting crowded! (For the time being I will not list others here because I haven't tested the quality of the services.)

How can biased training sets impact NLP tools?

Tatman, 2017 reports a bias in AI-based automatic captioning systems based on different dialects. She tested captioning on videos of the "accent tag challenge" and found automatic captioning was worst for Scottish dialects and for women's voices, even though those are equally valid examples of English speech. See an example here: