What are large language models?

Large language models largely represent a class of deep learning architectures called transformer networks. A transformer model is a neural network that learns context and meaning by tracking relationships in sequential data, like the words in this sentence.

A transformer is made up of multiple transformer blocks, also known as layers. For example, a transformer has self-attention layers, feed-forward layers, and normalization layers, all working together to decipher input to predict streams of output at inference. The layers can be stacked to make deeper transformers and powerful language models. Transformers were first introduced by Google in the 2017 paper “Attention Is All You Need.”

What’s a language model?

Language models may seem ultramodern, but they date as far back as 57 years ago, to ELIZA, a 1966 cutting-edge computer program that could effectively use natural language processing (NLP) to “converse” in a human-sounding way, for example, as a psychotherapist.

Why are Large Language Models Important?

Historically, AI models had been focused on perception and understanding.

However, large language models, which are trained on internet-scale datasets with hundreds of billions of parameters, have now unlocked an AI model’s ability to generate human-like content.

Models can read, write, code, draw, and create in a credible fashion and augment human creativity and improve productivity across industries to solve the world’s toughest problems.

The applications for these LLMs span across a plethora of use cases. For example, an AI system can learn the language of protein sequences to provide viable compounds that will help scientists develop groundbreaking, life-saving vaccines.

Or computers can help humans do what they do best—be creative, communicate, and create. A writer suffering from writer’s block can use a large language model to help spark their creativity.

Or a software programmer can be more productive, leveraging LLMs to generate code based on natural language descriptions.

Examples of large language models

It’s safe to say that large language models are proliferating. In addition to the ChatGPT-powered language models GPT-3 (175 billion parameters) and GPT-4 (more than 170 trillion parameters, used with Microsoft Bing), these large entities include:

BERT (Bidirectional Encoder Representations from Transformers, Google)
BLOOM (BigScience Large Open-science Open-access Multilingual Language Model; started by Hugging Face co-founder)
Claude 2 (Anthropic)
Ernie Bot (Baidu)
PaLM 2 (Pathways Language Model, used with Google BARD)
LLAMA (Meta)
RoBERTa (A Robustly Optimized BERT Pretraining Approach, Google)
T5 (Text-to-Text Transfer Transformer, Google)

Generation (e.g., story writing, marketing content creation)
Summarization (e.g., legal paraphrasing, meeting notes summarization)
Translation (e.g., between languages, text-to-code)
Classification (e.g., toxicity classification, sentiment analysis)
Chatbot (e.g., open-domain Q+A, virtual assistants)

How Do Large Language Models Work?

Thanks to the extensive training process that LLMs undergo, the models don’t need to be trained for any specific task and can instead serve multiple use cases. These types of models are known as foundation models.

The ability for the foundation model to generate text for a wide variety of purposes without much instruction or training is called zero-shot learning. Different variations of this capability include one-shot or few-shot learning, wherein the foundation model is fed one or a few examples illustrating how a task can be accomplished to understand and better perform on select use cases.

Despite the tremendous capabilities of zero-shot learning with large language models, developers and enterprises have an innate desire to tame these systems to behave in their desired manner. To deploy these large language models for specific use cases, the models can be customized using several techniques to achieve higher accuracy. Some techniques include prompt tuning, fine-tuning, and adapters.

Training LLMs using unsupervised learning

LLMs must be trained by feeding them tons of data — a “corpus” — which lets them establish expert awareness of how words work together. The input text data could take the form of everything from web content to marketing materials to entire books; the more information available to an LLM for training purposes, the better the output could be.

The training process for LLMs can involve several steps, typically beginning with unsupervised learning to identify patterns in unstructured data. When creating an AI model using supervised learning, the associated data labeling is a formidable obstacle. By contrast, with unsupervised learning, this intensive process is skipped, which means there’s much more available data for assimilating.

There are several classes of large language models that are suited for different types of use cases:

Encoder only: These models are typically suited for tasks that can understand language, such as classification and sentiment analysis. Examples of encoder-only models include BERT (Bidirectional Encoder Representations from Transformers).
Decoder only: This class of models is extremely good at generating language and content. Some use cases include story writing and blog generation. Examples of decoder-only architectures include GPT-3 (Generative Pretrained Transformer 3).
Encoder-decoder: These models combine the encoder and decoder components of the transformer architecture to both understand and generate content. Some use cases where this architecture shines include translation and summarization. Examples of encoder-decoder architectures include T5 (Text-to-Text Transformer).

Transformer processing

In the transformer neural network process, relationships between pairs of input tokens known as attention — for example, words — are measured. A transformer uses parallel multi-head attention, meaning the attention module repeats computations in parallel, affording more ability to encode nuances of word meanings.

A self-attention mechanism helps the LLM learn the associations between concepts and words. Transformers also utilize layer normalization, residual and feedforward connections, and positional embeddings.

Fine-tuning with supervised learning

The flip side is that while zero-shot learning can translate to comprehensive knowledge, the LLM can end up with an overly broad, limited outlook.

This is where companies can start the process of refining a foundation model for their specific use cases. Models can be fine tuned, prompt tuned, and adapted as needed using supervised learning. One tool for fine-tuning LLMs to generate the right text is reinforcement learning.

Content generation

When an LLM is trained, it can then generate new content in response to users’ parameters. For instance, if someone wanted to write a report in the company’s editorial style, they could prompt the LLM for it.

Applications

From machine translation to natural language processing (NLP) to computer vision, plus audio and multi-modal processing, transformers capture long-range dependencies and efficiently process sequential data. They’re used widely in neural machine translation (NMT), as well as to perform or improve AI systems and NLP business tasks and simplify enterprise workflows.

Transformers’ skill sets include: