Jupyter Notebook

LLMs From Scratch

Tokenization and dataset preparation for GPT-style language models, following the Build a Large Language Model (From Scratch) book by Sebastian Raschka.

Project Structure

src/
  data_utils.py   — download and load the-verdict.txt
  tokenizer.py    — vocabulary building and Tokenizer class
  dataset.py      — GPTDataset (PyTorch Dataset) and GPTLoader
  main.py         — demo runner
requirements.txt

Setup

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Run

cd src
python main.py

What It Does

Downloads the-verdict.txt (a short story used as training data)
Builds a character-level vocabulary from the text
Encodes the full text into integer token IDs
Demonstrates next-token prediction (context → target pairs)
Creates a PyTorch DataLoader using a sliding-window approach