Research & Projects
My work spans systems for AI, distributed systems, and cloud infrastructure. I focus on building production-grade systems that balance performance, reliability, and research rigor.
These projects range from research prototypes published in top venues to production systems handling real workloads at scale.
If you'd like to discuss any of these projects, reach out!
Table of Contents
Heterogeneity-Aware LLM Routing February 2025
ILP-guided scheduling with telemetry feedback for mixed GPU fleets.
A simulator and scheduler for heterogeneous LLM inference clusters that reduces P99 latency by 30% across mixed GPU fleets through intelligent request routing.
The system uses Integer Linear Programming (ILP) to guide initial placement decisions and incorporates real-time telemetry feedback loops to dynamically adjust routing as cluster conditions change.
Designed for production environments where GPU clusters contain multiple hardware generations (A100, V100, T4) with varying performance characteristics.
Key features:
- ILP-based optimization for request placement
- Telemetry feedback loops for adaptive routing
- Support for heterogeneous GPU types
- 30% reduction in P99 latency
- Simulator for offline evaluation
Technologies: Python, PyTorch, CVXPY (ILP solver)
Distributed MapReduce Framework January 2025
Master-worker topology with dynamic partitioning and fault tolerance.
A distributed MapReduce framework implementing the classic master-worker pattern with modern systems design principles. Built in C++ using gRPC for high-performance RPC communication.
The framework orchestrates distributed computation across worker nodes with dynamic partitioning, speculative retries for stragglers, and fault-aware checkpointing to handle multi-GB workloads with predictable latency.
Key features:
- Master-worker orchestration with health monitoring
- Dynamic partitioning for load balancing
- Speculative execution to handle slow workers
- Fault-aware checkpointing and recovery
- Handles multi-GB datasets with predictable performance
Technologies: C++17, gRPC, Protocol Buffers
AutoDNN Early-Exit Integration October 2024
Dynamic early-exit integration for adaptive inference serving.
Enhanced AutoDNN, a programmable inference serving system, with dynamic early-exit capabilities to reduce inference costs while maintaining SLA compliance under bursty demand.
Introduced adaptive branching policies that decide when to exit early from neural network computation based on confidence thresholds and current load. Developed compatibility shims to integrate seamlessly with the existing AutoDNN runtime.
The system maintains 99.99% SLA compliance while reducing average inference latency and cost by allowing simpler models to handle easier requests.
Key features:
- Adaptive early-exit branching policies
- Dynamic threshold adjustment based on load
- Compatibility layer for AutoDNN runtime
- Maintains SLA compliance under traffic bursts
- Cost reduction through selective early termination
Technologies: C++, Python, PyTorch
XFaaS – Cross-Cloud FaaS Orchestration May 2023
Cross-platform orchestration platform for serverless workflows on hybrid clouds.
XFaaS is a cross-cloud orchestration platform that enables seamless execution of serverless workflows across multiple cloud providers. It delivered 75% latency reductions and 57% cost savings through adaptive placement with telemetry-informed routing.
The system pairs dynamic workload placement with real-time telemetry to optimize function execution across AWS, Azure, and GCP. Industry partners adopted XFaaS to operationalize hybrid quantum-classical workflows.
Published in IEEE/ACM CCGrid 2023
Key features:
- Adaptive placement engine for multi-cloud FaaS
- Telemetry-informed routing and load balancing
- Support for hybrid quantum-classical workflows
- 75% latency reduction, 57% cost savings
Links: Paper