Pretraining and Finetuning
What is Pretraining
Pretraining refers to training a base model on large-scale datasets, letting the model learn general knowledge and patterns. This base model can then be used for various downstream tasks.
Analogy: Pretraining is like letting a person extensively read various books first, learning general knowledge and thinking methods, then later learning specific domain expertise.
Pretraining Data
Large language models' pretraining data typically includes:
- Web text (Wikipedia, news sites, blogs, etc.)
- Books
- Code repositories
- Papers
- Social media content
Data Scale: Modern large language models' pretraining data typically reaches trillion-level tokens.
Pretraining Objectives
Main objectives of the pretraining phase are to let the model learn:
- Language Knowledge: Grammar, vocabulary, semantics
- World Knowledge: Facts, concepts, relationships
- Reasoning Ability: Logic, causality, analogy
- General Capabilities: Understanding, generation, reasoning, etc.
Pretraining Objective Functions
Masked Language Modeling (MLM)
Representative Model: BERT
Method:
- Randomly mask some words in input sequence
- Let model predict these masked words
- Model needs to understand context to accurately predict
Example:
Input: The cat sat on the [MASK]
Prediction: matAdvantages:
- Can utilize both directions of context
- Suitable for understanding tasks
Disadvantages:
- Not suitable for generation tasks
- Lower training efficiency
Causal Language Modeling (CLM)
Representative Models: GPT series, Claude
Method:
- Given preceding words, predict next word
- Can only utilize preceding context
- Autoregressive generation
Example:
Input: The cat sat on the
Prediction: matAdvantages:
- Suitable for generation tasks
- High training efficiency
Disadvantages:
- Can only utilize unidirectional context
What is Finetuning
Finetuning refers to further training a pretrained model using specific task data, making the model adapt to specific tasks or domains.
Analogy: Finetuning is like a person who has extensively read, now learning specific domain expertise (like medicine, law, etc.).
Types of Finetuning
Full Finetuning: Update all parameters
- Advantages: Best results
- Disadvantages: High cost, requires large amounts of data
Partial Finetuning: Only update some parameters
- Advantages: Lower cost
- Disadvantages: Results may be slightly worse
Parameter-Efficient Finetuning: Only update a small number of parameters
- Representative methods: LoRA, Prefix Tuning, Adapter
- Advantages: Very low cost, minimal data requirements
- Disadvantages: Results may be slightly worse
Instruction Tuning
Instruction tuning is a special finetuning method that uses "instruction-response" pairs to train the model, enabling it to understand and follow natural language instructions.
Instruction Tuning Data
Instruction tuning data typically contains:
- Instruction: Natural language description of task
- Input: Specific input for task
- Output: Expected output
Example:
Instruction: Translate the following sentence to English
Input: I like artificial intelligence
Output: I like artificial intelligenceRole of Instruction Tuning
- Improve Instruction-following Ability: Let model understand and execute natural language instructions
- Improve Conversation Ability: Make model better at multi-round conversations
- Enhance Generalization: Let model handle unseen instructions
RLHF (Reinforcement Learning from Human Feedback)
RLHF is a method using human feedback to optimize models, making model outputs better align with human preferences.
RLHF Steps
- Collect Human Preference Data: Have humans rank different model outputs
- Train Reward Model: Train a reward model based on human preference data
- Optimize with Reinforcement Learning: Use reward model as reward signal to optimize language model
Role of RLHF
- Improve Output Quality: Make model outputs more helpful, honest, harmless
- Align with Human Preferences: Make model outputs better match human values
- Reduce Harmful Outputs: Lower probability of model generating harmful content
Parameter-Efficient Finetuning Methods
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient finetuning method that adapts to new tasks by adding low-rank matrices rather than updating all parameters.
Principle:
- Add low-rank matrices beside original weight matrices
- Only train these low-rank matrices
- Original weights remain unchanged
Advantages:
- Few parameters (usually less than 1% of original model)
- Low training cost
- Can easily switch between different tasks
Applications:
- Adapt to specific domains
- Personalize models
- Rapid experimentation
Other Methods
- Prefix Tuning: Add learnable prompts before input
- Adapter: Add small adapter layers in the model
- Prompt Tuning: Optimize input prompts rather than model parameters
Challenges and Limitations of Finetuning
Challenges
- Data Requirements: Need high-quality domain data
- Computational Resources: Even parameter-efficient finetuning requires some resources
- Catastrophic Forgetting: May forget knowledge learned during pretraining
- Overfitting: Easy to overfit on small datasets
Limitations
- Effect Limitations: Parameter-efficient finetuning results may not match full finetuning
- Domain Drift: Need frequent finetuning when domain changes quickly
- Evaluation Difficulty: How to evaluate finetuning effects is a challenge
Practical Applications
Pretrained Models
- GPT-4: OpenAI's general large language model
- Claude: Anthropic's AI assistant
- LLaMA: Meta's open source model
Finetuned Models
- CodeLlama: Finetuned model specifically for code
- Med-PaLM: Finetuned model for medical domain
- BloombergGPT: Finetuned model for financial domain
Summary
Pretraining and finetuning are core processes of large language model development:
- Pretraining: Learn general knowledge and capabilities on large-scale data
- Finetuning: Adapt to specific tasks or domains
- Instruction Tuning: Improve instruction-following ability
- RLHF: Align with human preferences
Understanding pretraining and finetuning helps:
- Choose appropriate models
- Decide whether finetuning is needed
- Understand model capabilities and limitations
Next Steps
- Context Window - Learn how models process long texts
- Tokenization - Learn how models process text