Tuhin Khare – Projects

Research & Projects

My work spans systems for AI, distributed systems, and cloud infrastructure. I focus on building production-grade systems that balance performance, reliability, and research rigor.

These projects range from research prototypes published in top venues to production systems handling real workloads at scale.

If you'd like to discuss any of these projects, reach out!

AutoDNN Early-Exit Integration
Distributed MapReduce Framework
Heterogeneity-Aware LLM Routing
XFaaS – Cross-Cloud FaaS Orchestration

Heterogeneity-Aware LLM Routing February 2025

Python

PyTorch

LLM Inference

ILP-guided scheduling with telemetry feedback for mixed GPU fleets.

A simulator and scheduler for heterogeneous LLM inference clusters that reduces P99 latency by 30% across mixed GPU fleets through intelligent request routing.

The system uses Integer Linear Programming (ILP) to guide initial placement decisions and incorporates real-time telemetry feedback loops to dynamically adjust routing as cluster conditions change.

Designed for production environments where GPU clusters contain multiple hardware generations (A100, V100, T4) with varying performance characteristics.

Key features:

ILP-based optimization for request placement
Telemetry feedback loops for adaptive routing
Support for heterogeneous GPU types
30% reduction in P99 latency
Simulator for offline evaluation

Technologies: Python, PyTorch, CVXPY (ILP solver)

Heterogeneity-Aware LLM Routing preview image

Distributed MapReduce Framework January 2025

C++

Distributed Systems

gRPC

Master-worker topology with dynamic partitioning and fault tolerance.

A distributed MapReduce framework implementing the classic master-worker pattern with modern systems design principles. Built in C++ using gRPC for high-performance RPC communication.

The framework orchestrates distributed computation across worker nodes with dynamic partitioning, speculative retries for stragglers, and fault-aware checkpointing to handle multi-GB workloads with predictable latency.

Key features:

Master-worker orchestration with health monitoring
Dynamic partitioning for load balancing
Speculative execution to handle slow workers
Fault-aware checkpointing and recovery
Handles multi-GB datasets with predictable performance

Technologies: C++17, gRPC, Protocol Buffers

Distributed MapReduce Framework preview image

AutoDNN Early-Exit Integration October 2024

C++

Python

Inference Serving

Dynamic early-exit integration for adaptive inference serving.

Enhanced AutoDNN, a programmable inference serving system, with dynamic early-exit capabilities to reduce inference costs while maintaining SLA compliance under bursty demand.

Introduced adaptive branching policies that decide when to exit early from neural network computation based on confidence thresholds and current load. Developed compatibility shims to integrate seamlessly with the existing AutoDNN runtime.

The system maintains 99.99% SLA compliance while reducing average inference latency and cost by allowing simpler models to handle easier requests.

Key features:

Adaptive early-exit branching policies
Dynamic threshold adjustment based on load
Compatibility layer for AutoDNN runtime
Maintains SLA compliance under traffic bursts
Cost reduction through selective early termination

Technologies: C++, Python, PyTorch

AutoDNN Early-Exit Integration preview image

XFaaS – Cross-Cloud FaaS Orchestration May 2023

Distributed Systems

Serverless

Hybrid Cloud

Cross-platform orchestration platform for serverless workflows on hybrid clouds.

XFaaS is a cross-cloud orchestration platform that enables seamless execution of serverless workflows across multiple cloud providers. It delivered 75% latency reductions and 57% cost savings through adaptive placement with telemetry-informed routing.

The system pairs dynamic workload placement with real-time telemetry to optimize function execution across AWS, Azure, and GCP. Industry partners adopted XFaaS to operationalize hybrid quantum-classical workflows.

Published in IEEE/ACM CCGrid 2023

Key features:

Adaptive placement engine for multi-cloud FaaS
Telemetry-informed routing and load balancing
Support for hybrid quantum-classical workflows
75% latency reduction, 57% cost savings

Links: Paper

XFaaS – Cross-Cloud FaaS Orchestration preview image

Tuhin Khare — Projects

Research & Projects

Table of Contents

Heterogeneity-Aware LLM Routing February 2025

Distributed MapReduce Framework January 2025

AutoDNN Early-Exit Integration October 2024

XFaaS – Cross-Cloud FaaS Orchestration May 2023