来源:Coggle数据科学
在当今人工智能领域,大型语言模型(LLM)的开发已经成为一个热门话题。这些模型通过学习大量的文本数据,能够生成自然语言文本,完成各种复杂的任务,如写作、翻译、问答等。
https://github.com/FareedKhan-dev/train-llm-from-scratch
本文将为你提供一个简单直接的方法,从下载数据到生成文本,带你一步步构建大院模型。
步骤1:GPU设备
在开始训练语言模型之前,你需要对面向对象编程(OOP)、神经网络(NN)和 PyTorch 有基本的了解。
训练语言模型需要强大的计算资源,尤其是 GPU。不同的 GPU 在内存容量和计算能力上有所不同,适合不同规模的模型训练。以下是一个详细的 GPU 对比表,帮助你选择合适的硬件。
13M LLM 训练
2B LLM 训练
步骤2:导入环境
在开始之前,我们需要导入一些必要的 Python 库。这些库将帮助我们处理数据、构建模型以及训练模型。
# PyTorch for deep learning functions and tensors
import torch
import torch.nn as nn
import torch.nn.functional as F
# Numerical operations and arrays handling
import numpy as np
# Handling HDF5 files
import h5py
# Operating system and file management
import os
# Command-line argument parsing
import argparse
# HTTP requests and interactions
import requests
# Progress bar for loops
from tqdm import tqdm
# JSON handling
import json
# Zstandard compression library
import zstandard as zstd
# Tokenization library for large language models
import tiktoken
# Math operations (used for advanced math functions)
import math
步骤3:加载数据集
The Pile 数据集是一个大规模、多样化的开源数据集,专为语言模型训练设计。它由 22 个子数据集组成,涵盖了书籍、文章、维基百科、代码、新闻等多种类型的文本。
# Download validation dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/val.jsonl.zst
# Download the first part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/00.jsonl.zst
# Download the second part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/01.jsonl.zst
# Download the third part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/02.jsonl.zst
最终处理好的数据集格式如下:
#### OUTPUT ####
Line: 0
{
"text":"Effect of sleep quality ... epilepsy.",
"meta": {
"pile_set_name":"PubMed Abstracts"
}
}
Line: 1
{
"text":"LLMops a new GitHub Repository ...",
"meta": {
"pile_set_name":"Github"
}
}
步骤4:Transformer 架构
Transformer 通过将文本分解成更小的单元,称为“标记”(token),并预测序列中的下一个标记来工作。Transformer 由多个层组成,这些层被称为 Transformer 块,它们一层叠一层,最后通过一个最终层来进行预测。
每个 Transformer 块包含两个主要组件:
1. 自注意力头(Self-Attention Heads)
自注意力头的作用是确定输入中哪些部分对模型来说最为重要。例如,在处理一个句子时,自注意力头可以突出显示单词之间的关系,比如代词与其所指代的名词之间的关系。通过这种方式,模型能够更好地理解句子的结构和语义。
2. 多层感知器(MLP,Multi-Layer Perceptron)
多层感知器是一个简单的前馈神经网络。它接收自注意力头强调的信息,并进一步处理这些信息。MLP 包含:
步骤5:多层感知器(MLP)
多层感知器(MLP)是 Transformer 架构中前馈神经网络(Feed-Forward Network, FFN)的核心组成部分。它的主要作用是引入非线性特性,并学习嵌入表示中的复杂关系。在定义 MLP 模块时,一个重要的参数是n_embed,它定义了输入嵌入的维度。
MLP 的整个序列转换过程使得它能够对注意力机制学习到的表示进行进一步的精细化处理。具体来说:
# --- MLP (Multi-Layer Perceptron) Class ---
class MLP(nn.Module):
"""
A simple Multi-Layer Perceptron with one hidden layer.
This module is used within the Transformer block for feed-forward processing.
It expands the input embedding size, applies a ReLU activation, and then projects it back
to the original embedding size.
"""
def __init__(self, n_embed):
super().__init__()
self.hidden = nn.Linear(n_embed, 4 * n_embed) # Linear layer to expand embedding size
self.relu = nn.ReLU() # ReLU activation function
self.proj = nn.Linear(4 * n_embed, n_embed) # Linear layer to project back to original size
def forward(self, x):
"""
Forward pass through the MLP.
Args:
x (torch.Tensor): Input tensor of shape (B, T, C), where B is batch size,
T is sequence length, and C is embedding size.
Returns:
torch.Tensor: Output tensor of the same shape as the input.
"""
x = self.forward_embedding(x)
x = self.project_embedding(x)
returnx
def forward_embedding(self, x):
"""
Applies the hidden linear layer followed by ReLU activation.
Args:
x (torch.Tensor): Input tensor.
Returns:
torch.Tensor: Output after the hidden layer and ReLU.
"""
x = self.relu(self.hidden(x))
returnx
def project_embedding(self, x):
"""
Applies the projection linear layer.
Args:
x (torch.Tensor): Input tensor.
Returns:
torch.Tensor: Output after the projection layer.
"""
x = self.proj(x)
returnx
步骤6:Single Head Attention
注意力头(Attention Head)是 Transformer 模型的核心部分,其主要作用是让模型能够专注于输入序列中与当前任务最相关的部分。在定义注意力头模块时,有几个重要的参数:
在注意力头内部,我们初始化了三个无偏置的线性层(nn.Linear),分别用于键、查询和值的投影。此外,我们注册了一个大小为context_length x context_length的下三角矩阵(tril)作为缓冲区(buffer),以实现因果掩码,防止注意力机制关注未来的标记。
# --- Attention Head Class ---
class Head(nn.Module):
def __init__(self, head_size, n_embed, context_length):
super().__init__()
self.key = nn.Linear(n_embed, head_size, bias=False) # Key projection
self.query = nn.Linear(n_embed, head_size, bias=False)# Query projection
self.value = nn.Linear(n_embed, head_size, bias=False)# Value projection
# Lower triangular matrix for causal masking
self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))
def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
scale_factor = 1 / math.sqrt(C)
# Calculate attention weights: (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
attn_weights = q @ k.transpose(-2, -1) * scale_factor
# Apply causal masking
attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0,float('-inf'))
attn_weights = F.softmax(attn_weights, dim=-1)
v = self.value(x) # (B, T, head_size)
# Apply attention weights to values
out = attn_weights @ v# (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
returnout
步骤7:Multi Head Attention
多头注意力(Multi-Head Attention)是 Transformer 架构中的关键机制,用于捕捉输入序列中多样化的关联关系。通过将多个独立的注意力头(attention heads)并行运行,模型能够同时关注输入的不同方面,从而更全面地理解序列信息。
class MultiHeadAttention(nn.Module):
"""
Multi-Head Attention module.
This module combines multiple attention heads in parallel. The outputs of each head
are concatenated to form the final output.
"""
def __init__(self, n_head, n_embed, context_length):
super().__init__()
self.heads = nn.ModuleList([Head(n_embed // n_head, n_embed, context_length)for_inrange(n_head)])
def forward(self, x):
"""
Forward pass through the multi-head attention.
Args:
x (torch.Tensor): Input tensor of shape (B, T, C).
Returns:
torch.Tensor: Output tensor after concatenating the outputs of all heads.
"""
# Concatenate the output of each head along the last dimension (C)
x = torch.cat([h(x)forhinself.heads], dim=-1)
returnx
步骤8:Transformer 块
Transformer 块是 Transformer 架构的核心单元,它通过组合多头注意力机制和前馈网络(MLP),并应用层归一化(Layer Normalization)以及残差连接(Residual Connections),来处理输入并学习复杂的模式。
每个 Transformer 块包含以下部分:
class Block(nn.Module):
def __init__(self, n_head, n_embed, context_length):
super().__init__()
self.ln1 = nn.LayerNorm(n_embed)
self.attn = MultiHeadAttention(n_head, n_embed, context_length)
self.ln2 = nn.LayerNorm(n_embed)
self.mlp = MLP(n_embed)
def forward(self, x):
# Apply multi-head attention with residual connection
x = x self.attn(self.ln1(x))
# Apply MLP with residual connection
x = x self.mlp(self.ln2(x))
returnx
def forward_embedding(self, x):
res = x self.attn(self.ln1(x))
x = self.mlp.forward_embedding(self.ln2(res))
returnx, res
步骤9:完整模型结构
到目前为止,我们已经编写了 Transformer 模型的一些小部件,如多头注意力(Multi-Head Attention)和 MLP(多层感知器)。接下来,我们需要将这些部件整合起来,构建一个完整的 Transformer 模型,用于执行序列到序列的任务。为此,我们需要定义几个关键参数:n_head、n_embed、context_length、vocab_size和N_BLOCKS。
# --- Transformer Model Class ---
class Transformer(nn.Module):
"""
The main Transformer model.
This class combines token and position embeddings with a sequence of Transformer blocks
and a final linear layer for language modeling.
"""
def __init__(self, n_head, n_embed, context_length, vocab_size, N_BLOCKS):
super().__init__()
self.context_length = context_length
self.N_BLOCKS = N_BLOCKS
self.token_embed = nn.Embedding(vocab_size, n_embed)
self.position_embed = nn.Embedding(context_length, n_embed)
self.attn_blocks = nn.ModuleList([Block(n_head, n_embed, context_length)for_inrange(N_BLOCKS)])
self.layer_norm = nn.LayerNorm(n_embed)
self.lm_head = nn.Linear(n_embed, vocab_size)
self.register_buffer('pos_idxs', torch.arange(context_length))
def _pre_attn_pass(self, idx):
B, T = idx.shape
tok_embedding = self.token_embed(idx)
pos_embedding = self.position_embed(self.pos_idxs[:T])
returntok_embedding pos_embedding
def forward(self, idx, targets=None):
x = self._pre_attn_pass(idx)
forblockinself.attn_blocks:
x = block(x)
x = self.layer_norm(x)
logits = self.lm_head(x)
loss = None
iftargets is not None:
B, T, C = logits.shape
flat_logits = logits.view(B * T, C)
targets = targets.view(B * T).long()
loss = F.cross_entropy(flat_logits, targets)
returnlogits, loss
def forward_embedding(self, idx):
x = self._pre_attn_pass(idx)
residual = x
forblockinself.attn_blocks:
x, residual = block.forward_embedding(x)
returnx, residual
def generate(self, idx, max_new_tokens):
for_inrange(max_new_tokens):
idx_cond = idx[:, -self.context_length:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
returnidx
步骤10:训练参数配置
现在我们已经完成了模型的编码工作,接下来需要定义训练参数,包括注意力头的数量、Transformer 块的数量等,以及数据路径等相关配置。
步骤11:模型训练
我们使用 AdamW 优化器,这是一种改进版的 Adam 优化器,适用于深度学习任务。
步骤12:生成文本
接下来,我们将创建一个函数generate_text,用于从保存的模型中生成文本。该函数接受保存的模型路径和输入文本作为输入,并返回生成的文本。我们还将比较数百万参数模型和数十亿参数模型在生成文本时的表现。
def generate_text(model_path, input_text, max_length=512, device="gpu"):
# Load the model checkpoint
checkpoint = torch.load(model_path)
# Initialize the model (you should ensure that the Transformer class is defined elsewhere)
model = Transformer().to(device)
# Load the model's state dictionary
model.load_state_dict(checkpoint['model_state_dict'])
# Load the tokenizer for the GPT model (we use 'r50k_base' for GPT models)
enc = tiktoken.get_encoding('r50k_base')
# Encode the input text along with the end-of-text token
input_ids = torch.tensor(
enc.encode(input_text, allowed_special={'<|endoftext|>'}),
dtype=torch.long
)[None, :].to(device) # Add batch dimension and move to the specified device
# Generate text with the model using the encoded input
with torch.no_grad():
# Generate up to 'max_length' tokens of text
generated_output = model.generate(input_ids, max_length)
# Decode the generated tokens back into text
generated_text = enc.decode(generated_output[0].tolist())
returngenerated_text
免责声明:本文为转载,非本网原创内容,不代表本网观点。其原创性以及文中陈述文字和内容未经本站证实,对本文以及其中全部或者部分内容、文字的真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
如有疑问请发送邮件至:bangqikeconnect@gmail.com