Rakshith Mathad - Portfolio

About Me

I am a Software Engineer specializing in LLMs, Generative AI and Data Science, currently working at CVS Health (Aetna Insurance) in New York City. I'm passionately curious and I like building large scalable AI and LLM systems with a keen interest in Inference Optimization, GPUs, HPC, and parallelism. With a strong background in AI, Machine Learning, and Software Engineering, I focus on building scalable AI solutions and large language model applications. I hold an MS in Applied Data Science from The University of Southern California and have extensive experience in developing and deploying AI systems that impact/scale to millions of users.

Technical Skills

Languages: Python, SQL, C, C++, CUDA (Intermediate), OpenMP (Intermediate)

Tools & Technologies: Analytics, Deep Learning, Machine Learning, A/B Testing, GCP Vertex AI, AWS, Kubernetes K8s, Docker, Git, Informatica ETL, OpenMP(Intermediate), Web Scraping/Automation, AI Research, Parallel Programming, Numpy, Pandas, Scikit learn, LangChain, Nvidia DGX Server, Nsight Profiling, Nvidia NeMo, LLMOps, RAG, VectorDBs, LLM Training and Inference techniques, HuggingFace, FastAPI, RESTful API, Responsible AI, Linux, MongoDB, Selenium, Hadoop HDFS, PySpark, PowerBI/Tableau, LLMs, Generative AI, MapReduce, Azure, Google Cloud Platform, BigQuery, HiveQL, Horovod, ArcGIS, MLOps, Jenkins CI/CD, NLP, Forecasting, Unsupervised ML, Generative Models

Download Resume

My Blogs

Token of Thoughts: Inference Optimization and Serving of LLMs on GPUs

An in-depth exploration of LLM inference optimization techniques and GPU serving strategies for large language models.

Read Article

Token of Thoughts — LLM Serving Techniques in Production

Inside AI data centers: what happens when an LLM request hits a GPU cluster—queuing, batching, distributed execution, and inference across NVIDIA-dominated hardware/software stacks.

Read Article

Experience

Software Engineer - Generative AI

CVS Health, New York City, NY

June 2024 - Present

Working for Aetna Insurance in the Conversational AI customer service team to build a large-scale complex RAG and Rule based FastAPI chat application on GCP Google AI Platform with a very high impact for call centre customer service
Implemented async/multithreaded/parallel semantic search using OpenAI GPT PTUs and Gemini LLMs, Feature Store and Matching Engine, improving latency and caching strategy
Scaled production AI system to ~20,000 req/hr impacting 300k customers daily using Kubernetes GKE and Vertex AI. The system has handled aboout 200+ million calls/requests per year on average
Performed complex data ingestion and preprocessing on GCP BigQuery and BigTable, including chunking and embedding strategies
Worked with Airflow, K8s, Jenkins CI/CD, LLM Evaluation framework, feature flags, and AI safety guardrails
AI Scalability, Retrieval, Inference, Ingestion and Preprocessing
High volume low latency Generative AI Tier 1 system

Data Engineer Intern

AEG Entertainment Group, Los Angeles, CA

February 2024 - May 2024

Built complex Azure Data Factory pipelines for large data processing
Implemented PySpark in Databricks for batch API processing
Handled Dynamics CRM data migration via OData REST APIs
Managed Parquet/Avro for streaming and financial data reporting

Analytics Engineering Intern

CVS Health, New York City, NY

May 2023 - August 2023

Optimized Hadoop HDFS big data pipelines, processing 100M+ rows with HiveQL
Reduced latency by 40% through optimization techniques
Designed scalable ETL workflows and customer campaign strategies
Leveraged Tableau, advanced OLAP, data blending, and predictive modeling

AI Software Engineer Intern

AlphaICs Corporation, India

January 2022 - July 2022

Developed deep learning applications for Object Detection and Visual Attention
Implemented Recommendation Systems using Matrix Factorization and Collaborative Filtering
Optimized AI Inference on custom AI processor

AI Research Intern

Samsung R&D Institute, India

November 2020 - June 2021

Led team of 3 in SOTA Generative AI PyTorch research
Implemented Conditional GAN with spatially adaptive normalization for Image manipulation
Performed Semantic Segmentation with DeepLab-V2 to curate 20k-image dataset
Trained generative models on Nvidia DGX cluster

Education

Master of Science in Applied Data Science

University of Southern California, Los Angeles

August 2022 - May 2024

GPA: 3.7/4.0

Coursework: Machine Learning for Data Science and AI, Applications of Data Mining, Predictive Analytics, Fairness, Security and Privacy in AI

Bachelor of Engineering in Computer Science

KLE Technological University, India

August 2018 - June 2022

GPA: 3.95/4.0

Coursework: Data Structures and Algorithms, Data Mining, Machine Learning, Distributed & High-Performance Computing, Cloud Computing

Projects & Certifications

Parallelism for LLM Inference on GPUs

Deployed and served BERT Transformer on RTX GPU using PyTorch, ONNX, and NVIDIA TensorRT runtimes
Profiled performance bottlenecks using NVIDIA Nsight
Built CUDA Kernel for optimized inference on custom NN
Deployed async TinyLlama using Ray Serve and FastAPI
Implemented FSDP/DDP training simulations using Ray/DeepSpeed

View on GitHub

Simple CUDA + OpenMP Hybrid Vector Addition Project

Working on a high-performance CUDA + OpenMP hybrid vector addition project that combines CPU and GPU parallelism
The implementation achieves an 8.8x performance improvement over single CUDA streams by using 4 OpenMP threads with parallel CUDA streams for overlapped memory transfers and optimal load balancing

View on GitHub

RAG-based LLM with Guardrails

Built RAG-based chatbot using LangChain and Llama 2
Integrated Pinecone VectorDB and HuggingFace Sentence Transformer
Implemented NVIDIA NeMo Guardrails and LoRA LLM finetuning
Developed e2e pipeline for LoRa finetuning and quantization

Research: Distributed Deep Learning

Published paper on "Performance Analysis of Distributed Deep Learning using Horovod on Image Classification" at IEEE ICICCS
Developed ML privacy-preserving techniques for Federated Learning systems
Implemented anomaly detection for adversary client nodes
Applied Homomorphic Encryption in distributed learning

View Publication Google Scholar

Certifications

NVIDIA CUDA Computing

View Certificate

Juniper Networks Certified Associate, Junos (JNCIA-Junos)

View Certificate

Playground

Good Reads

Understanding and Coding the KV Cache in LLMs from Scratch — A practical guide to KV caches for efficient LLM inference. Read Article

Understanding LLM System with 3-layer Abstraction — A three-layer model to reason about large language model systems. Read Article

Building

GPT3 Implementation from Scratch — Implementing core transformer components (multi-head attention, FFN) to deeply understand the architecture. View on GitHub

Contact Me

rakshith1262k@gmail.com mathad@usc.edu (213) 272-7811 LinkedIn GitHub Blog Google Scholar