From-Scratch GPT: Building a Mini LLM and End-to-End MLOps Pipeline

Problem

Most developers interact with Large Language Models (LLMs) as black boxes — fine-tuning, prompting, or deploying pre-trained systems like GPT-4 or LLaMA without understanding their inner mechanics.
I wanted to demystify these systems by implementing a miniature GPT architecture entirely from scratch, covering every step from tokenization to serving, while keeping it light enough to train on a local GPU (2GB VRAM) or CPU.
The goal was to build an educational yet functional mini-LLM pipeline that mirrors real-world systems — including data processing, model training, inference APIs, and lightweight MLOps tracking.

Solution Overview

The project implements a simplified GPT-style language model that learns next-token prediction on small text corpora (e.g., Tiny Shakespeare).
It provides an end-to-end training and serving workflow:

Upload a raw text corpus
Train a custom BPE tokenizer
Encode the corpus into token IDs
Train a mini GPT model using PyTorch
Serve the model through a FastAPI backend
Interact with the system via a clean HTML UI
Track experiments and checkpoints with DVC & MLflow

This structure reflects a complete production-like ML system, scaled down for accessibility and clarity.

Architecture & Design

         ┌──────────────────────────────────────┐
         │            UI Layer                  │
         │   (HTML/JS) — Upload | Train | Test   │
         └──────────────┬───────────────────────┘
                        │
                        ▼
┌────────────────────────────────────────────────────────┐
│                  FastAPI Backend                       │
│ /upload → /build_tokenizer → /train → /generate        │
│ Handles orchestration, caching, and model loading      │
└────────────────────────────────────────────────────────┘
                        │
                        ▼
┌────────────────────────────────────────────────────────┐
│                     Core Logic                         │
│ ├── tokenizer/bpe_tokenizer.py                         │
│ ├── gpt/model.py ← Transformer decoder                 │
│ ├── gpt/train.py ← Training loop (AdamW + scheduler)   │
│ ├── gpt/generate.py ← Sampling (top-k, nucleus, temp)  │
│ └── data/encode.py ← Prepare dataset batches           │
└────────────────────────────────────────────────────────┘
                        │
                        ▼
┌────────────────────────────────────────────────────────┐
│                    MLOps Layer                         │
│ - DVC pipeline (tokenize → encode → train)             │
│ - MLflow experiment logging (loss, perplexity)         │
│ - Dockerfile for serving container                     │
└────────────────────────────────────────────────────────┘

Technical Highlights

Byte-Pair Encoding (BPE) Tokenizer

Implemented from scratch with vocabulary learning and merge rules.
Trains on any text file and exports vocab.json / merges.txt.

Decoder-only Transformer

Multi-head self-attention, learned positional embeddings, LayerNorm, GELU activations.
Supports variable block sizes for efficient mini-batch training.

Training Pipeline

AdamW optimizer with weight decay.
Warmup and cosine learning rate scheduler.
Checkpoint saving every N steps.

Text Generation

Configurable temperature, top-k, and top-p (nucleus) sampling.
Deterministic seeding for reproducible outputs.

MLOps Integration

DVC to version and reproduce the full pipeline (tokenize → encode → train).
MLflow to log hyperparameters, losses, and perplexity metrics.
Docker container for running the FastAPI inference server.

Visual UI

Single-page web app (HTML/CSS/JS) served by FastAPI.
Upload corpus → Build vocab → Train model → Generate text interactively.

My Contributions

This was a solo project, and I personally implemented and integrated every component:

Designed the full PyTorch GPT architecture from scratch (no Hugging Face dependency).
Built the BPE tokenizer and text encoding pipeline.
Implemented training loop, loss tracking, and sampling algorithms.
Developed a FastAPI backend with modular endpoints for each phase.
Created a custom web UI for visualizing the workflow interactively.
Added DVC and MLflow for experiment tracking and reproducibility.
Wrote pytest tests and structured the project for scalability and clarity.

Results & Impact

Trained a working mini GPT model on Tiny Shakespeare that can generate coherent text.
Achieved smooth performance on CPU and low-VRAM GPU (2GB) setups.
Delivered a self-contained educational LLM system, bridging deep learning and MLOps concepts.
Used the project as a teaching and portfolio artifact to demonstrate end-to-end ML system design.

Example output:
Prompt: "Once upon a time"
Model: "Once upon a time there was a noble prince, and the wind whispered soft verses of love..."

What I Learned

Deepened my understanding of how tokenization, attention, and sampling interact in GPT architectures.
Learned how to structure ML pipelines reproducibly using DVC and MLflow.
Improved skills in serving ML models efficiently through FastAPI.
Gained appreciation for system thinking — connecting modeling, engineering, and user interaction.