From-Scratch GPT: Building a Mini LLM and End-to-End MLOps Pipeline
A fully custom GPT implementation from first principles — including tokenizer, Transformer, training, API, MLOps tracking, and an interactive web UI.
Problem
Most developers interact with Large Language Models (LLMs) as black boxes — fine-tuning, prompting, or deploying pre-trained systems like GPT-4 or LLaMA without understanding their inner mechanics.
I wanted to demystify these systems by implementing a miniature GPT architecture entirely from scratch, covering every step from tokenization to serving, while keeping it light enough to train on a local GPU (2GB VRAM) or CPU.
The goal was to build an educational yet functional mini-LLM pipeline that mirrors real-world systems — including data processing, model training, inference APIs, and lightweight MLOps tracking.
Solution Overview
The project implements a simplified GPT-style language model that learns next-token prediction on small text corpora (e.g., Tiny Shakespeare).
It provides an end-to-end training and serving workflow:
- Upload a raw text corpus
- Train a custom BPE tokenizer
- Encode the corpus into token IDs
- Train a mini GPT model using PyTorch
- Serve the model through a FastAPI backend
- Interact with the system via a clean HTML UI
- Track experiments and checkpoints with DVC & MLflow
This structure reflects a complete production-like ML system, scaled down for accessibility and clarity.
Architecture & Design
┌──────────────────────────────────────┐
│ UI Layer │
│ (HTML/JS) — Upload | Train | Test │
└──────────────┬───────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ /upload → /build_tokenizer → /train → /generate │
│ Handles orchestration, caching, and model loading │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Core Logic │
│ ├── tokenizer/bpe_tokenizer.py │
│ ├── gpt/model.py ← Transformer decoder │
│ ├── gpt/train.py ← Training loop (AdamW + scheduler) │
│ ├── gpt/generate.py ← Sampling (top-k, nucleus, temp) │
│ └── data/encode.py ← Prepare dataset batches │
└────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ MLOps Layer │
│ - DVC pipeline (tokenize → encode → train) │
│ - MLflow experiment logging (loss, perplexity) │
│ - Dockerfile for serving container │
└────────────────────────────────────────────────────────┘
Technical Highlights
Byte-Pair Encoding (BPE) Tokenizer
- Implemented from scratch with vocabulary learning and merge rules.
- Trains on any text file and exports
vocab.json/merges.txt.
Decoder-only Transformer
- Multi-head self-attention, learned positional embeddings, LayerNorm, GELU activations.
- Supports variable block sizes for efficient mini-batch training.
Training Pipeline
- AdamW optimizer with weight decay.
- Warmup and cosine learning rate scheduler.
- Checkpoint saving every N steps.
Text Generation
- Configurable temperature, top-k, and top-p (nucleus) sampling.
- Deterministic seeding for reproducible outputs.
MLOps Integration
- DVC to version and reproduce the full pipeline (
tokenize → encode → train). - MLflow to log hyperparameters, losses, and perplexity metrics.
- Docker container for running the FastAPI inference server.
Visual UI
- Single-page web app (HTML/CSS/JS) served by FastAPI.
- Upload corpus → Build vocab → Train model → Generate text interactively.
My Contributions
This was a solo project, and I personally implemented and integrated every component:
- Designed the full PyTorch GPT architecture from scratch (no Hugging Face dependency).
- Built the BPE tokenizer and text encoding pipeline.
- Implemented training loop, loss tracking, and sampling algorithms.
- Developed a FastAPI backend with modular endpoints for each phase.
- Created a custom web UI for visualizing the workflow interactively.
- Added DVC and MLflow for experiment tracking and reproducibility.
- Wrote pytest tests and structured the project for scalability and clarity.
Results & Impact
- Trained a working mini GPT model on Tiny Shakespeare that can generate coherent text.
- Achieved smooth performance on CPU and low-VRAM GPU (2GB) setups.
- Delivered a self-contained educational LLM system, bridging deep learning and MLOps concepts.
- Used the project as a teaching and portfolio artifact to demonstrate end-to-end ML system design.
Example output:
Prompt: "Once upon a time"
Model: "Once upon a time there was a noble prince, and the wind whispered soft verses of love..."
What I Learned
- Deepened my understanding of how tokenization, attention, and sampling interact in GPT architectures.
- Learned how to structure ML pipelines reproducibly using DVC and MLflow.
- Improved skills in serving ML models efficiently through FastAPI.
- Gained appreciation for system thinking — connecting modeling, engineering, and user interaction.