Large Language Models
Part-I: Practical LLM &
Part-II: Theoretical Framework for LLM
Part-I: Practical LLM &
Part-II: Theoretical Framework for LLM
Why It Matters: RAG combines LLMs with retrieval systems to reduce hallucinations and improve factual accuracy.
Introduction to RAG
Content:
Three-Step Workflow: Retrieve → Augment → Generate.
Demo: Build a RAG pipeline using Llama-3 and FAISS for vector search.
Code walkthrough for document chunking, embedding, and query augmentation.
Learning Outcome: Implement a basic RAG system from scratch.
Video : https://youtu.be/DBprEyQBeKQ?si=HeILh01l6SxkWgBs
Code Walkthrough: https://www.quantacosmos.com/2024/06/rag-retrieval-augmented-generation-llm.html
Why It Matters: Basic RAG struggles with complex queries. These methods add structure to retrieval.
2.1. Graph-Based RAG (GraphRAG)
Part 1: Theory
Content:
Represent documents as knowledge graphs (entities + relationships).
Use graph traversal for context-aware retrieval.
Code Walkthrough: https://www.quantacosmos.com/2024/06/rag-retrieval-augmented-generation-llm.html
Part 2: Implementation
Content:
Offline demo with Llama-3 and NetworkX for graph operations.
Querying subgraphs for precise context extraction.
Video: https://youtu.be/pbhRFZwmOvU?si=7lXozQwyxkacZ4We
Code Walkthrough: https://www.quantacosmos.com/2024/06/rag-retrieval-augmented-generation-llm.html
2.2. Knowledge Hypergraphs.
Content:
Extend graphs to n-ary relationships (e.g., "Company A acquires Company B for $X in Year Y").
Demo: Storing hyperedges in a graph database (e.g., Neo4j).
Video : https://youtu.be/SPt5O3rpHIo?si=VZuPc_y_Pfs5K0_o
Code Walkthrough: https://www.quantacosmos.com/2024/06/knowledge-hyper-graph-with-llm-rag.html
2.3 Zero-Shot & One-Shot RAG
Zero-Shot:
Content: Answer queries without task-specific training (e.g., "Explain quantum physics to a 5-year-old").
Code Walkthrough: https://www.quantacosmos.com/2024/06/zero-shot-llm-rag-with-knowledge-graph.html
One-Shot:
Content: Adapt to custom tasks with a single example (e.g., "Generate a sales email using this template").
Video: https://youtu.be/AusPKVSkvGI?si=OICT124ec2_LRUT8
Code Walkthrough: https://www.quantacosmos.com/2024/06/one-shot-llm-rag-with-knowledge-graph.html
Customize LLMs for Domain-Specific Tasks
Why It Matters: Full fine-tuning is resource-heavy. PEFT methods reduce costs while retaining performance.
3.1.1. LoRA (Low-Rank Adaptation)
Content:
Inject low-rank matrices into transformer layers.
Mathematical intuition behind rank reduction (SVD analogy).
Code Walkthrough: https://www.quantacosmos.com/2024/06/lora-qlora-and-fine-tuning-large.html
3.1.2. QLoRA (Quantized LoRA)
Content:
4-bit quantization + LoRA for memory-efficient training.
Benchmark comparisons: QLoRA vs. LoRA vs. full fine-tuning.
Video: https://youtu.be/24Px6Gr5uiQ?si=VCdldpU84genKJUo
Code Walkthrough: https://www.quantacosmos.com/2024/06/lora-qlora-and-fine-tuning-large.html
3.1.3. DORA (Dynamic Low-Rank Adaptation)
Content:
Automatically adjust the rank of LoRA matrices during training.
When to prefer DORA over static LoRA.
Video: https://youtu.be/PAalu1hKTy4?si=QOr_c1MeR8SHRygA
Code Walkthrough: https://www.quantacosmos.com/2024/07/finetune-large-language-models-with.html
For High-Resource Scenarios
Fine-Tuning Llama-3 Locally
Content:
Hardware Setup: GPU/CPU requirements, RAM optimization.
Data preparation: Formatting instruction datasets (e.g., Alpaca-style).
Code: Training loops, checkpointing, and evaluation.
Video: https://www.youtube.com/watch?v=H1x7Y-6B6Y0
Code Walkthrough: https://www.quantacosmos.com/2024/06/fine-tune-pretrained-large-language.html
Innovate with Hybrid AI Systems
Why It Matters: Combine neural networks’ pattern recognition with symbolic logic’s reasoning.
Algorithmic Trading Case Study
Content:
Symbolic Component: Rule-based market indicators (e.g., moving averages).
Neural Component: LLM analyzing news sentiment.
Fusion: Decision engine balancing both inputs.
Video: https://youtu.be/5qEXCxsV4Og?si=3tenzF8wDtcZQohE
Code Walkthrough: https://www.quantacosmos.com/2025/02/enhancing-algorithmic-trading-with.html
Why It Matters: Deploy LLMs on edge devices (e.g., laptops, phones).
Quantization Basics
Content:
8-bit vs. 4-bit precision tradeoffs.
Tools: GGUF, bitsandbytes, and llama.cpp.
Video: https://youtu.be/yNNNfFiuKAI?si=9fBEj3EXIRw2_52a
Hands-On Support for Real-World Projects
Why It Matters: Avoid cloud costs and privacy risks by running models offline.
Step-by-Step Setup
Content:
Downloading Llama-3 weights (via Hugging Face or direct links).
Using llama-cpp-python for CPU inference.
Optimizing inference speed with Metal (Mac) or CUDA (NVIDIA).
Video Guide: https://youtu.be/AaoxeuQD-Sg?si=ijxRbynG2B98nvt3
Contains:
How LLM agents “negotiate” on every patient case
Attention-based neural policy networks making smart choices
Live reward feedback and policy updates (REINFORCE)
A real Python code walkthrough (Groq API, Keras 3, Sentence Transformers)
Insights into how AI agents learn to make decisions with multiple advisors
YouTube Video: https://www.youtube.com/watch?v=YkyoJFmnaco&t=1s
Please find the code at: https://www.quantacosmos.com/2025/07/next-gen-ai-multi-agent-llms-and-policy.html
Contains:
How semantic, structural, and entropy features reveal AI text
Why LLM meta-classification (self-consistency voting) beats single-method detection
How adversarial tricks try to fool detectors—and how to spot them
YouTube Video Link: https://www.youtube.com/watch?v=V1FsQgVpPqo&t=194s
The working code is available at: https://www.quantacosmos.com/2025/07/ai-vs-human-writing-robust-hybrid-llm.html
Contains:
What is MAMBA?- Understand the motivation and theory behind Maximum-Memory Attention with Multiplicative Bias Architecture.
Core Building Blocks: See how Causal 1D Convolution (local mixing), State Evolution (SSM core), and Multiplicative Bias (gating) fit together.
Deep Learning Architecture: Explore how MAMBA layers are stacked to create powerful, scalable models for text, time series, and beyond.
Complete Example: Follow a step-by-step walk-through with a small matrix example, seeing how each input is transformed as it passes through the model.
Part-1: YouTube Video Link: https://youtu.be/lDa1_k7An7A?si=mZKJheGnUIi-O_Qc
Part-2: YouTube Video Link: https://youtu.be/-zpMjbvN_QE?si=XQcuSi0OGKfK9-Je
Contains:
A detailed Mathematical Description
YouTube Video Link: https://www.youtube.com/watch?v=KNoW9E-TDU8
Contains:
Why Vanilla Transformer Does not Support Long Range Extrapolation.
Basics of the Current Approaches.
YouTube Video Link: https://youtu.be/q2otBk4Wcx8?si=8LMVy5j0S9K2vN_9
Contains:
Attention with Linear Biases Algorithm.
Discussions & future Techniques.
YouTube Video Link: https://youtu.be/I04hB_QAjFU?si=9HqMktBBvEucXiUd
Contains:
Key issues with the traditional Positional Encoding.
RoFormer Algorithm.
Future Research direction.
YouTube Video Link: https://youtu.be/5WhQecvWX7U?si=OYF5XBEdjrSLZGA8
Contains:
Unlocking the Power of LLMs: Why Quantization Matters and How to Master It Effectively
YouTube Video Link: https://youtu.be/yNNNfFiuKAI?si=idKML0y-6R0KwJf4
Contains:
Learn How to transition from Input text to [Query, Key, and Value] Vectors and Ultimately, to the final Context Vector.
YouTube Video Link: https://youtu.be/L_bBglaRPfo?si=N7r_JvBzzz2J388D
Content Summary:
Cross-attention is a mechanism in deep learning, particularly in Transformer models, that allows one sequence of data (query) to attend to another sequence (key-value pairs) dynamically. Unlike self-attention, which models dependencies within the same sequence, cross-attention enables interaction between different modalities, sources, or input types.
YouTube Video Link: https://youtu.be/WfJ8waoakeQ?si=YzfJ75HZDLCcP9iX
Content Summary:
Sparse attention is an optimization technique used in Transformer models to reduce computational complexity by selectively attending to a subset of input tokens instead of all tokens. Unlike full self-attention, which scales quadratically with sequence length, sparse attention introduces structured sparsity in the attention matrix, making it more efficient for long sequences.
YouTube Video Link: https://youtu.be/fWto5Ozpjsc?si=YYl-cAIT7P5TnWQ6
Content Summary:
Gated Self-Attention is a mechanism that blends the original input with the attention output using a learnable gate to control how much influence the attention mechanism has. It's used to prevent misleading information from dominating, which is especially useful for long texts, and it enables dynamic trust control over each feature. This approach is beneficial for tasks like sentiment analysis and multi-modal reasoning.
YouTube Video Link: https://youtu.be/MkdFMvrlyIc?si=sE9S63sIxCvcuRml
Content Summary:
Want to know what makes the latest AI models faster, smarter, and more efficient? The answer is Mixture of Experts (MoE) — a powerful mechanism that routes each input to specialized mini-networks (experts) instead of using one giant model. In this video, we break down MoE with simple examples, explain how it outperforms traditional attention, and show why it's shaping the future of scalable and modular AI — including Agentic LLMs and intelligent assistants.
YouTube Video Link: https://youtu.be/s0gdPL4wZ4Q?si=oa2_xPTNAEBqkKW0
Contains:
In this Tutorial, we dive deep into one of the most powerful innovations behind modern large language models (LLMs) — the KV Cache, and its cutting-edge evolutions: Paged KV Cache and KV-AdaQuant.
YouTube Video Link: https://youtu.be/fslTpSJ0Ops?si=KiIh0BG9o_7qs8Eb