Back to projects
Jupyter Notebook
LLMs From Scratch
Tokenization and dataset preparation for GPT-style language models, following the Build a Large Language Model (From Scratch) book by Sebastian Raschka.
Project Structure
src/
data_utils.py — download and load the-verdict.txt
tokenizer.py — vocabulary building and Tokenizer class
dataset.py — GPTDataset (PyTorch Dataset) and GPTLoader
main.py — demo runner
requirements.txt
Setup
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Run
cd src
python main.py
What It Does
- Downloads
the-verdict.txt(a short story used as training data) - Builds a character-level vocabulary from the text
- Encodes the full text into integer token IDs
- Demonstrates next-token prediction (context → target pairs)
- Creates a PyTorch
DataLoaderusing a sliding-window approach