If you run a foundation model as a product, you eventually hit a familiar request: each customer wants their own version of the model — tuned to their data, serving only their traffic, with their data kept isolated from everyone else's. The obvious way to do that is to fine-tune a separate copy of the model per customer. It works, and it is miserable to serve.
This post walks through the pattern most teams converge on instead: one shared base model plus a tiny per-customer adapter, served together. It's a well-trodden, productized approach, and the serving economics are genuinely good rather than merely tolerable. We'll cover why the naive approach hurts, how Low-Rank Adaptation (LoRA) changes the math, how good LoRA actually is compared to a full fine-tune, and how you can batch requests from different customers together on the same GPU.
The problem with one model per customer
Imagine a model with weights around 12GB. Fully fine-tuning a separate copy per customer means that every time you switch which customer you're serving, you load a fresh 12GB of weights onto the GPU. That I/O dominates everything else. In practice it means you can't co-locate customers on the same hardware at all — you end up paying for a dedicated, mostly-idle replica per customer. The cost scales linearly with your customer count, and most of that cost is idle GPU.
What you actually want is to keep the expensive, shared part of the model resident on the GPU once and swap only the small, customer-specific part. That's exactly what LoRA gives you.
One shared base, many tiny adapters
LoRA fine-tunes a model by freezing the original weights and learning a small low-rank "correction" on top. Instead of updating a weight matrix W directly, you learn two small matrices A and B and add their product as a delta:
output = x·Wᵀ + (x·Aᵀ)·Bᵀ · (α/r)
└─────┘ └──────────────┘
shared base per-customer delta
A is [r × d] and B is [d × r], where the rank r is small (typically 8–64) and the hidden dimension d is in the thousands. The adapter is therefore much smaller than the full weights — tens of MB for a model whose base is gigabytes, depending on rank and which layers you target.
This changes the serving economics completely. You keep the base model resident on the GPU once and stream in only the small per-customer adapter. Loading an adapter is a millisecond-scale operation rather than a multi-second multi-gigabyte transfer.
There's one implementation detail that matters for multi-tenant serving. You can merge an adapter into the base (W' = W + BA), which gives zero inference overhead — but then that GPU is locked to a single customer, which defeats the entire purpose. So for shared serving you keep the base and the adapter separate, compute the two terms independently, and add them. You pay a small overhead for the delta, and in return you get to serve many customers from one resident base.
As a bonus, this gives you clean data isolation almost for free: each customer's adaptation lives entirely inside their own adapter weights and never touches the shared base.
Is this actually done in production? Yes
This is a well-established serving pattern, not research speculation. The most relevant work:
- LoRA — Hu et al., 2021 (arXiv:2106.09685, ICLR 2022). The original method.
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters — Sheng et al., 2023 (arXiv:2311.03285, MLSys 2024). The most directly relevant paper for multi-tenant serving. It serves thousands of adapters against one shared base, keeps adapters in main memory and pages them onto the GPU on demand, and uses custom kernels to batch adapters of different ranks together. It reports serving roughly 2,000 adapters on a single A100 and up to about 4× the throughput of naive LoRA support in vLLM/PEFT.
- Punica: Multi-Tenant LoRA Serving — Chen et al., 2023 (arXiv:2310.18547, MLSys 2024). Introduces the SGMV kernel that makes cross-customer batching efficient (more below), reporting around 12× throughput over prior systems while adding only ~2ms latency per token.
- dLoRA: Dynamically Orchestrating Requests and Adapters — Wu et al., OSDI 2024. Automates the merged-vs-unmerged decision with a credit-based batching algorithm and migrates requests and adapters across replicas to balance load.
On the framework side, this is already productized. vLLM has multi-LoRA support, LoRAX (Predibase) is purpose-built for the multi-tenant pattern, and TensorRT-LLM supports it too. The practical takeaway: you almost certainly should not write your own serving kernels. Start from an existing stack.
How good is LoRA versus a "real" fine-tune?
The honest answer is "it depends on how far the customer is from the base," and there's now good literature mapping that out.
The original LoRA paper showed parity with — sometimes better than — full fine-tuning on adaptation-style tasks. The most useful counterweight is "LoRA Learns Less and Forgets Less" (Biderman et al., TMLR 2024, arXiv:2405.09673). Their finding, in one sentence: in standard low-rank settings LoRA underperforms full fine-tuning most when the target requires learning substantial new knowledge far from pretraining (their hard cases were code and math) — but it forgets the base capabilities much less, acts as a strong regularizer, and keeps generations more diverse. They also observed that full fine-tuning learns weight changes with 10–100× higher rank than a typical LoRA, which helps explain the gap.
The practical reading: if each customer is mostly a shift within the same domain your base was trained on, rather than a leap into genuinely new territory, you're in the regime where LoRA tends to do well. A sensible default is "LoRA is good enough," with full fine-tuning kept in your back pocket as a per-customer escape hatch for the rare customer that proves hard.
For customers in between, the knobs to reach for before resorting to a full fine-tune are: raise the rank, target more layers, or use DoRA (Liu et al., ICML 2024 Oral, arXiv:2402.09353), which splits each weight into a magnitude and a direction and applies LoRA to the direction. It closes much of the remaining gap to full fine-tuning at similar parameter cost and adds no inference overhead.
One caveat worth flagging. Nearly all of the published quality evidence is on language models. If you're applying this to a different modality — time-series, vision, audio — the mechanics transfer fine (LoRA attaches to any linear layer), but the quality numbers above are borrowed from the LLM world and may not transfer cleanly. If you're in a less-studied modality, run a LoRA-versus-full-fine-tune comparison on your own data early rather than assuming the gap is the same.
Can you batch inference across different customers' adapters?
Yes — and this is the part that makes the economics good rather than just acceptable.
The key observation: the expensive, shared part of the computation — attention and the large feed-forward matrix multiplies using the base weights W — is identical for every request in the batch, regardless of which customer sent it. So it runs as one big, efficient matrix multiply and gets the full benefit of large batch sizes. The only thing that differs per request is the small low-rank delta (x·Aᵀ)·Bᵀ, where each request may use a different A, B.
The naive way to handle that divergence — loop over requests and do each adapter's little multiply separately — is exactly what kills throughput, because those tiny per-request operations become memory-bound and waste the GPU. The real solution is custom kernels that batch across different adapters:
- Punica's SGMV (Segmented Gather Matrix-Vector multiplication) groups requests by which adapter they use, gathers the right
A/Bper group, and does the low-rank multiply as a single batched operation. Overhead stays small relative to the shared base pass. (Punica's original SGMV assumed all adapters share a rank, padding otherwise.) - S-LoRA's batched kernels do the same thing but also handle adapters of different ranks in one batch, plus the paged memory management for swapping adapters in and out of GPU memory.
So the cost structure lands where you'd hope: the expensive base path is fully shared and batch-efficient, and the per-customer cost is just a small rank-r projection done inside a grouped kernel.
On the "MoE-style routing" intuition
It's tempting to think of this like Mixture-of-Experts, and the instinct is apt as a systems pattern — group requests by which weights they need, then do segmented matmuls. But one distinction matters. In MoE, routing is learned and the experts are part of the model; any token can be sent to any expert. In multi-tenant LoRA, the "routing" is trivial and deterministic: you already know which customer sent each request, so there's no router and no load-balancing loss to train. It's just a gather by customer ID. (The two ideas have been combined — "Mixture of LoRA Experts" — but for one-adapter-per-customer serving you want the plain deterministic version, which is simpler and is exactly what SGMV implements.)
Practical notes
A few things that follow directly from the above:
- Adapter cold-start is cheap. Loading a customer's adapter is an MB-scale transfer, not a multi-gigabyte base load, so on-demand loading from CPU or disk is fine. S-LoRA-style paging handles eviction when you have more adapters than fit in GPU memory at once.
- Merged versus unmerged is a latency knob, not a one-time decision. A customer with steady, heavy traffic can justify a dedicated merged replica (lowest possible latency). The long tail of low-traffic customers shares the unmerged multi-tenant pool. dLoRA essentially automates this tradeoff at runtime.
- Don't build the serving layer yourself. Start from vLLM, LoRAX, or TensorRT-LLM; these already implement the paging and batched kernels.
A recommended path
If you're building this, a reasonable default plan looks like: adopt an existing multi-LoRA serving framework; default every customer to a LoRA adapter against the shared base; validate the LoRA-versus-full-fine-tune quality gap on your own data as an early milestone (especially outside the LLM modality); and reserve higher rank, more target layers, DoRA, or a dedicated full fine-tune for the small number of customers that turn out to need it.
The headline result is that "one model per customer" doesn't have to mean "one GPU per customer." With a shared base and per-customer adapters, you can collapse the long tail of customers onto shared hardware, keep each customer's data cleanly isolated, and still hand every customer a model that behaves like it's theirs.
References
- Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 (2021); ICLR 2022.
- Sheng et al. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285 (2023); MLSys 2024.
- Chen et al. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 (2023); MLSys 2024.
- Wu et al. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving. USENIX OSDI 2024.
- Biderman et al. LoRA Learns Less and Forgets Less. arXiv:2405.09673; TMLR 2024.
- Liu et al. DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv:2402.09353; ICML 2024 (Oral).
Frameworks referenced: vLLM (multi-LoRA), LoRAX (Predibase), NVIDIA TensorRT-LLM.