About Me

Rakshith Mathad

I am a Software Engineer specializing in LLMs, Generative AI and Data Science, currently working at CVS Health (Aetna Insurance) in New York City. I'm passionately curious and I like building large scalable AI and LLM systems with a keen interest in Inference Optimization, GPUs, HPC, and parallelism. With a strong background in AI, Machine Learning, and Software Engineering, I focus on building scalable AI solutions and large language model applications. I hold an MS in Applied Data Science from The University of Southern California and have extensive experience in developing and deploying AI systems that impact/scale to millions of users.

Technical Skills

Languages: Python, SQL, C, C++, CUDA (Intermediate), OpenMP (Intermediate)

Tools & Technologies: Analytics, Deep Learning, Machine Learning, A/B Testing, GCP Vertex AI, AWS, Kubernetes K8s, Docker, Git, Informatica ETL, OpenMP(Intermediate), Web Scraping/Automation, AI Research, Parallel Programming, Numpy, Pandas, Scikit learn, LangChain, Nvidia DGX Server, Nsight Profiling, Nvidia NeMo, LLMOps, RAG, VectorDBs, LLM Training and Inference techniques, HuggingFace, FastAPI, RESTful API, Responsible AI, Linux, MongoDB, Selenium, Hadoop HDFS, PySpark, PowerBI/Tableau, LLMs, Generative AI, MapReduce, Azure, Google Cloud Platform, BigQuery, HiveQL, Horovod, ArcGIS, MLOps, Jenkins CI/CD, NLP, Forecasting, Unsupervised ML, Generative Models

My Blogs

Token of Thoughts: Inference Optimization and Serving of LLMs on GPUs

An in-depth exploration of LLM inference optimization techniques and GPU serving strategies for large language models.

Read Article

Token of Thoughts — LLM Serving Techniques in Production

Inside AI data centers: what happens when an LLM request hits a GPU cluster—queuing, batching, distributed execution, and inference across NVIDIA-dominated hardware/software stacks.

Read Article

Experience

Software Engineer - Generative AI

CVS Health, New York City, NY

June 2024 - Present

  • Working for Aetna Insurance in the Conversational AI customer service team to build a large-scale complex RAG and Rule based FastAPI chat application on GCP Google AI Platform with a very high impact for call centre customer service
  • Implemented async/multithreaded/parallel semantic search using OpenAI GPT PTUs and Gemini LLMs, Feature Store and Matching Engine, improving latency and caching strategy
  • Scaled production AI system to ~20,000 req/hr impacting 300k customers daily using Kubernetes GKE and Vertex AI. The system has handled aboout 200+ million calls/requests per year on average
  • Performed complex data ingestion and preprocessing on GCP BigQuery and BigTable, including chunking and embedding strategies
  • Worked with Airflow, K8s, Jenkins CI/CD, LLM Evaluation framework, feature flags, and AI safety guardrails
  • AI Scalability, Retrieval, Inference, Ingestion and Preprocessing
  • High volume low latency Generative AI Tier 1 system

Data Engineer Intern

AEG Entertainment Group, Los Angeles, CA

February 2024 - May 2024

  • Built complex Azure Data Factory pipelines for large data processing
  • Implemented PySpark in Databricks for batch API processing
  • Handled Dynamics CRM data migration via OData REST APIs
  • Managed Parquet/Avro for streaming and financial data reporting

Analytics Engineering Intern

CVS Health, New York City, NY

May 2023 - August 2023

  • Optimized Hadoop HDFS big data pipelines, processing 100M+ rows with HiveQL
  • Reduced latency by 40% through optimization techniques
  • Designed scalable ETL workflows and customer campaign strategies
  • Leveraged Tableau, advanced OLAP, data blending, and predictive modeling

AI Software Engineer Intern

AlphaICs Corporation, India

January 2022 - July 2022

  • Developed deep learning applications for Object Detection and Visual Attention
  • Implemented Recommendation Systems using Matrix Factorization and Collaborative Filtering
  • Optimized AI Inference on custom AI processor

AI Research Intern

Samsung R&D Institute, India

November 2020 - June 2021

  • Led team of 3 in SOTA Generative AI PyTorch research
  • Implemented Conditional GAN with spatially adaptive normalization for Image manipulation
  • Performed Semantic Segmentation with DeepLab-V2 to curate 20k-image dataset
  • Trained generative models on Nvidia DGX cluster

Education

Master of Science in Applied Data Science

University of Southern California, Los Angeles

August 2022 - May 2024

GPA: 3.7/4.0

Coursework: Machine Learning for Data Science and AI, Applications of Data Mining, Predictive Analytics, Fairness, Security and Privacy in AI

Bachelor of Engineering in Computer Science

KLE Technological University, India

August 2018 - June 2022

GPA: 3.95/4.0

Coursework: Data Structures and Algorithms, Data Mining, Machine Learning, Distributed & High-Performance Computing, Cloud Computing

Projects & Certifications

Parallelism for LLM Inference on GPUs

  • Deployed and served BERT Transformer on RTX GPU using PyTorch, ONNX, and NVIDIA TensorRT runtimes
  • Profiled performance bottlenecks using NVIDIA Nsight
  • Built CUDA Kernel for optimized inference on custom NN
  • Deployed async TinyLlama using Ray Serve and FastAPI
  • Implemented FSDP/DDP training simulations using Ray/DeepSpeed

Simple CUDA + OpenMP Hybrid Vector Addition Project

  • Working on a high-performance CUDA + OpenMP hybrid vector addition project that combines CPU and GPU parallelism
  • The implementation achieves an 8.8x performance improvement over single CUDA streams by using 4 OpenMP threads with parallel CUDA streams for overlapped memory transfers and optimal load balancing

RAG-based LLM with Guardrails

  • Built RAG-based chatbot using LangChain and Llama 2
  • Integrated Pinecone VectorDB and HuggingFace Sentence Transformer
  • Implemented NVIDIA NeMo Guardrails and LoRA LLM finetuning
  • Developed e2e pipeline for LoRa finetuning and quantization

Research: Distributed Deep Learning

  • Published paper on "Performance Analysis of Distributed Deep Learning using Horovod on Image Classification" at IEEE ICICCS
  • Developed ML privacy-preserving techniques for Federated Learning systems
  • Implemented anomaly detection for adversary client nodes
  • Applied Homomorphic Encryption in distributed learning
View Publication Google Scholar

Certifications

NVIDIA CUDA Computing

View Certificate

Juniper Networks Certified Associate, Junos (JNCIA-Junos)

View Certificate

Playground

Good Reads

Understanding and Coding the KV Cache in LLMs from Scratch — A practical guide to KV caches for efficient LLM inference. Read Article

Understanding LLM System with 3-layer Abstraction — A three-layer model to reason about large language model systems. Read Article

Building

GPT3 Implementation from Scratch — Implementing core transformer components (multi-head attention, FFN) to deeply understand the architecture. View on GitHub

Contact Me