Pretraining and Finetuning

What is Pretraining

Pretraining refers to training a base model on large-scale datasets, letting the model learn general knowledge and patterns. This base model can then be used for various downstream tasks.

Analogy: Pretraining is like letting a person extensively read various books first, learning general knowledge and thinking methods, then later learning specific domain expertise.

Pretraining Data

Large language models' pretraining data typically includes:

Web text (Wikipedia, news sites, blogs, etc.)
Books
Code repositories
Papers
Social media content

Data Scale: Modern large language models' pretraining data typically reaches trillion-level tokens.

Pretraining Objectives

Main objectives of the pretraining phase are to let the model learn:

Language Knowledge: Grammar, vocabulary, semantics
World Knowledge: Facts, concepts, relationships
Reasoning Ability: Logic, causality, analogy
General Capabilities: Understanding, generation, reasoning, etc.

Pretraining Objective Functions

Masked Language Modeling (MLM)

Representative Model: BERT

Method:

Randomly mask some words in input sequence
Let model predict these masked words
Model needs to understand context to accurately predict

Example:

Input: The cat sat on the [MASK]
Prediction: mat

Advantages:

Can utilize both directions of context
Suitable for understanding tasks

Disadvantages:

Not suitable for generation tasks
Lower training efficiency

Causal Language Modeling (CLM)

Representative Models: GPT series, Claude

Method:

Given preceding words, predict next word
Can only utilize preceding context
Autoregressive generation

Example:

Input: The cat sat on the
Prediction: mat

Advantages:

Suitable for generation tasks
High training efficiency

Disadvantages:

Can only utilize unidirectional context

What is Finetuning

Finetuning refers to further training a pretrained model using specific task data, making the model adapt to specific tasks or domains.

Analogy: Finetuning is like a person who has extensively read, now learning specific domain expertise (like medicine, law, etc.).

Types of Finetuning

Full Finetuning: Update all parameters
- Advantages: Best results
- Disadvantages: High cost, requires large amounts of data
Partial Finetuning: Only update some parameters
- Advantages: Lower cost
- Disadvantages: Results may be slightly worse
Parameter-Efficient Finetuning: Only update a small number of parameters
- Representative methods: LoRA, Prefix Tuning, Adapter
- Advantages: Very low cost, minimal data requirements
- Disadvantages: Results may be slightly worse

Instruction Tuning

Instruction tuning is a special finetuning method that uses "instruction-response" pairs to train the model, enabling it to understand and follow natural language instructions.

Instruction Tuning Data

Instruction tuning data typically contains:

Instruction: Natural language description of task
Input: Specific input for task
Output: Expected output

Example:

Instruction: Translate the following sentence to English
Input: I like artificial intelligence
Output: I like artificial intelligence

Role of Instruction Tuning

Improve Instruction-following Ability: Let model understand and execute natural language instructions
Improve Conversation Ability: Make model better at multi-round conversations
Enhance Generalization: Let model handle unseen instructions

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a method using human feedback to optimize models, making model outputs better align with human preferences.

RLHF Steps

Collect Human Preference Data: Have humans rank different model outputs
Train Reward Model: Train a reward model based on human preference data
Optimize with Reinforcement Learning: Use reward model as reward signal to optimize language model

Role of RLHF

Improve Output Quality: Make model outputs more helpful, honest, harmless
Align with Human Preferences: Make model outputs better match human values
Reduce Harmful Outputs: Lower probability of model generating harmful content

Parameter-Efficient Finetuning Methods

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient finetuning method that adapts to new tasks by adding low-rank matrices rather than updating all parameters.

Principle:

Add low-rank matrices beside original weight matrices
Only train these low-rank matrices
Original weights remain unchanged

Advantages:

Few parameters (usually less than 1% of original model)
Low training cost
Can easily switch between different tasks

Applications:

Adapt to specific domains
Personalize models
Rapid experimentation

Other Methods

Prefix Tuning: Add learnable prompts before input
Adapter: Add small adapter layers in the model
Prompt Tuning: Optimize input prompts rather than model parameters

Challenges and Limitations of Finetuning

Challenges

Data Requirements: Need high-quality domain data
Computational Resources: Even parameter-efficient finetuning requires some resources
Catastrophic Forgetting: May forget knowledge learned during pretraining
Overfitting: Easy to overfit on small datasets

Limitations

Effect Limitations: Parameter-efficient finetuning results may not match full finetuning
Domain Drift: Need frequent finetuning when domain changes quickly
Evaluation Difficulty: How to evaluate finetuning effects is a challenge

Practical Applications

Pretrained Models

GPT-4: OpenAI's general large language model
Claude: Anthropic's AI assistant
LLaMA: Meta's open source model

Finetuned Models

CodeLlama: Finetuned model specifically for code
Med-PaLM: Finetuned model for medical domain
BloombergGPT: Finetuned model for financial domain

Summary

Pretraining and finetuning are core processes of large language model development:

Pretraining: Learn general knowledge and capabilities on large-scale data
Finetuning: Adapt to specific tasks or domains
Instruction Tuning: Improve instruction-following ability
RLHF: Align with human preferences

Understanding pretraining and finetuning helps:

Choose appropriate models
Decide whether finetuning is needed
Understand model capabilities and limitations

Next Steps

Context Window - Learn how models process long texts
Tokenization - Learn how models process text

Pretraining and Finetuning ​

What is Pretraining ​

Pretraining Data ​

Pretraining Objectives ​

Pretraining Objective Functions ​

Masked Language Modeling (MLM) ​

Causal Language Modeling (CLM) ​

What is Finetuning ​

Types of Finetuning ​

Instruction Tuning ​

Instruction Tuning Data ​

Role of Instruction Tuning ​

RLHF (Reinforcement Learning from Human Feedback) ​

RLHF Steps ​

Role of RLHF ​

Parameter-Efficient Finetuning Methods ​

LoRA (Low-Rank Adaptation) ​

Other Methods ​

Challenges and Limitations of Finetuning ​

Challenges ​

Limitations ​

Practical Applications ​

Pretrained Models ​

Finetuned Models ​

Summary ​

Next Steps ​

Pretraining and Finetuning

What is Pretraining

Pretraining Data

Pretraining Objectives

Pretraining Objective Functions

Masked Language Modeling (MLM)

Causal Language Modeling (CLM)

What is Finetuning

Types of Finetuning

Instruction Tuning

Instruction Tuning Data

Role of Instruction Tuning

RLHF (Reinforcement Learning from Human Feedback)

RLHF Steps

Role of RLHF

Parameter-Efficient Finetuning Methods

LoRA (Low-Rank Adaptation)

Other Methods

Challenges and Limitations of Finetuning

Challenges

Limitations

Practical Applications

Pretrained Models

Finetuned Models

Summary

Next Steps