3-Node HCI Slashes SaaS Costs 70 %

Remote-office and Web Database SaaS vendor based in WDC has completed a ground-up infrastructure modernisation, retiring twenty legacy servers and nearly 400 TB of mixed Dell EMC Isilon and Equinox storage. The legacy 1990s platform—home to both an original monolithic Java application and a 2010s refactor—now runs on just three commodity servers with CephFS-backed OpenNebula/KVM, achieving a 99.9 % SLA and 2–5x faster end-user response times.

Highlights:

70 % operating-cost reduction (power, rack, support, licences).

100 % CAPEX eliminated via Leaseweb opex rentals.

2–5x faster file operations (latency 100–250 ms → 20–50 ms).

4-hour migration window with zero data loss.

480 TB raw capacity (160 TB usable) replicated 3x across CephFS.

Modern Architecture:

Nodes: 3x Dell R740XD (2x Intel Xeon 4214 + 258 GB RAM)

Node Storage: 10×16 TB HDD + 2×8 TB NVMe (RAID-10 journal/DB)

Network: 10GbE internal network (VLAN in Leaseweb internal net), 10GbE external network.

Storage: Ceph Quincy (RBD on 3x replicated pool for VM images and CephFS on 3x replicated pool)

Compute: OpenNebula 6.6.0 with KVM nodes

Migration Timeframe:

Assessment & PoC: 8 weeks. Build the test cluster on our lab workbench. Test launch on synthetic data.

Synchronization: 4 weeks First naive attempt, slow storage metadata processing, parallelised rsync + DB replication

Cutover: 4 hours DNS and DB roles flip, smoke tests.

Intensive Care: 1 week: email reputation treatment, ticket resolving, decommission of the old setup.

Why Migrate Now?

End-of-support Isilon contracts (Q4 2023) and 2023 energy-price spike made the old footprint more expensive to power than to replace. Constrained 100 Mbps links throttled all operations making the replacement of damaged storage machine insanely long; modernising on commodity HCI promised linear scale and open-source economics.

FREQUENTLY ASKED QUESTIONS

Q: What compelled a “rip-and-replace” rather than incremental upgrades?
A: Support contracts on five Isilon arrays expired, parts were scarce, and just keeping the lights on cost as much as leasing fresh hardware. Incremental fixes couldn’t solve the 100 Mbps bottleneck or growing support fees.

Q2: Why only three nodes—doesn’t Ceph need more?
A three-node cluster is the minimum for safe replicated pools; SSD journals mitigate the HDD penalty. Capacity can scale simply by racking a fourth node when growth resumes.

Q3: How did you move ~380 TB without downtime?
Data was split into thousands of file-set shards and rsynced in parallel for a month. Lists of the fresh files were formed from metadata in the main database and synced later.

Q4: Any way to turn back if the new cluster fails smoke test?
A: The legacy DB became a slave of the new cluster at the moment of cut-over, enabling instant fail-back if needed. Loss of files added to the new cluster was considered as acceptable risk by stakeholders, to save the budget.

Q5: Biggest “war-story” moments?
Hidden Equinox array: unknown to our team, proclaimed deprecated after discovery, discovered being in prod mid-sync—added to plan in 24 h.
WebDAV Basic Auth killed: Microsoft disabled it in it’s products mid-project; a Digest-Auth proxy (OpenResty + Redis) was stapled together in less than a week to save the day.
Pre-fail HDD shipment: one server arrived with multiple SMART errors—disk swap completed before go-live, still a huge pain due to low Ceph recovery performance on HDD

Q6: Performance tuning secrets?
A: External consultants failed to hit targets; in-house admin tweaked CephFS read-ahead, rebalanced placement groups aggressively, and spun up numerous metadata servers.

Q7: What license model do you pay for now?
A: 100 % open-source stack (Ceph, OpenNebula, Ansible, Zabbix). Only costs for our customer are node rental and our vendor-agnostic server management fee.

Q8: Any compromises?
CephFS on HDDs is slower than NVMe, but journaling plus aggressive caching still let it be faster than old storages. POSIX semantics were mandatory due to software architecture; moving to S3-style object storage remains on the road-map.

Q9: How big was the team?
Four core members: DBA, DevOps/SysAdmin, Customer Relations Manager, Java Developer—plus a short-term Ceph consultant. NK Support provided DevOps/SysAdmin, Customer Relations Manager and 24/7 End User Support.

Q10: What’s next?
– DR cluster in a second Leaseweb region
– Gradual containerisation atop OpenNebula
– CI/CD rollout to shorten release times

Q11: I have old infra too, can you upgrade mine?
A: Sure, contact us or write to info@nksupport.com to get free consultation and audit.