Tag Archives: storage

Cloud infrastructure Economics: Cogs and operating costs

Perhaps the most important benefit of adopting cloud services (either from a public provider or internally from your organization) is that their cost can be quantified and attributed to organizational entities. If a cloud service cannot be metered and measured, then it should not be called a cloud service right?

So, whenever you need to purchase a cloud service or when you are called to develop one, you are presented with a service catalog and assorted pricelists, from where you can budget, plan and compare services. Understanding how the pricing has been formulated is not part of your business since you are on the consumer side. However, you should care: You need to get what you pay for. There must be a very good reason for a very expensive or a very cheap cloud service.

In the past, we have developed a few cloud services utilizing own resources and third party services. Each and every time, determining whether launching the service commercially would be a sound practice depended on two factors:

  • Would customers pay for the service? If yes, at what price?
  • If a similar service already was on the market, where would our competitors stand?
  • What is the operating cost of the service?

Answering the first two questions is straightforward: Visit a good number of trusted and loyal customers, talk to them, investigate competition. That’s a marketing and sales mini project. But answering the last question can be a hard thing to do.

Let us share some insight on the operating costs and cost-of-goods for a cloud service and in particular, infrastructure as a service (IaaS). Whether you already run IaaS for your organization or your customers, you are in one of the following states:

  1. Planning to launch IaaS
  2. Already running your datacenter

State (1) is where you have not yet invested anything. You need to work on implementation and operational scenarios (build or buy? Hire or rent?) and do a good part of marketing plans. State (2) is where you have already invested, you have people, processes and technology in place and are delivering services to your internal or external customers. In state (1) you need to develop a cost model, in state (2) you need to calculate your costs and discover your real operating cost.

In both cases, the first thing you need to do before you move on with cost calculation is to guesstimate (state 1) or calculate (state 2) the footprint of your investment and delivered services. From our experience, the following parameters are what you should absolutely take into account in order to properly find out how much your IaaS really costs.

Financial parameters (real money)

  • EPC: Electrical power and hosting cost. How much do (or would) you pay for electricity and hosting. This can be found from your electricity bill, your datacenter provider monthly invoice or from your financial controller (just make sure you ask them the right questions, unless you want to get billed with the entire company overhead costs). EPC is proportional to your infrastructure footprint (ie number of cabinets and hardware).
  • DCOPS: Payroll for the operations team. You need to calculate the total human resource costs here for the team that will operate IaaS services. You may include here also marketing & sales overhead costs.
  • CALCLIC: Software licensing and support costs for IaaS entire computing infrastructure layer. These are software costs associated with the infrastructure (eg, hypervisor licenses), not license costs for delivered services, eg Microsoft SPLA costs.
  • STORLIC: Software licensing and support costs for your entire storage infrastructure. Include here in their entirety also data backup software costs.
  • SERVER: Cost of a single computing server. It’s good to standardize on a particular server model (eg 2-way or 4-way, rackmount or blade). Here you should include the cost of a computing server, complete with processors but without RAM. RAM to CPU ratio is a resource that is adjusted according to your expected workloads and plays a substantial role in cost calculation. If you plan to use blade servers, you should factor here the blade chassis as well.
  • MEMORY: Average cost of 1 GB or RAM.
  • STORINFRA: Cost of your storage infrastructure, as is, or the storage infrastructure you plan to purchase. Storage costs are not that easy to calculate as a factor of 1 disk GB units, since you have to take into account SAN, backup infrastructure, array controllers, disk enclosures and single disks. Of course we assume you utilize a centralized storage infrastructure, pooled to your entire computing farm.
  • NETINFRA: Cost of data network. As above, include here datacenter LAN, load balancers, routers, even cabling.
  • NETSUPP: Cost of network support (monthly). Include here software licensing, antivirus subscriptions and datacenter network costs.

Operational parameters (Facts and figures)

  • RUAmount of available rack units in your datacenter. This is the RU number you can use to install equipment (protected with UPS, with dual power feeds etc).
  • RU_STOR: Rack units occupied by storage systems
  • RU_CALC: Rack units occupied by computing infrastructure (hypervisors)
  • RU_NET: Rack units occupied by network infrastructure
  • SRV: Virtual machines (already running or how many you plan to have within the next quarter)
  • INTRST: Interest rate (cost of money): Monthly interest rate of credit lines/business loans
  • TOTALMEM: Total amount of virtual memory your SRV occupy
  • TOTALSTOR: Total amount of virtual storage your SRV occupy
  • SRVRAM: Amount of physical memory for each physical server. This is the amount of RAM you install in each computing server. It is one of the most important factors, since it depends on your average workload. A rule of thumb is that for generic workloads, a hardware CPU thread can sustain up to 6 virtual computing cores (vcpu). For each vcpu, you need 4 GB of virtual RAM. So, for a 2-socket, 6-core server you need 2 (sockets) x 6 (cores) x 6 (vcpu) x 4 (GB RAM) = 288 GB RAM. For a 4-way, 8-core server beast with memory intensive workloads (say 8 GB per vcpu) you need 4 x 8 x 6 x 8 = 1536 GB RAM (1.5 TB).
  • MEMOVERPROV: Memory overprovisioning for virtual workloads. A factor that needs tuning from experience. If you plan conservatively, use a 1:1 overprovisioning factor (1 GB of physical RAM to 1 GB of virtual RAM). If you are more confident and plan to save costs, you can calculate an overprovisioning factor of up to 1.3. Do this if you trust your hypervisor technology and have homogenous workloads on your servers (for example, all-Windows ecosystem) so that your hypervisor can take advantage of copy-on-write algorithms and save physical memory.
  • AMORT: Amortization of your infrastructure. This is a logistics & accounting term, but here we mainly use this to calculate the lifespan of our infrastructure. It is expressed in months. A good value is 36 to 60 months (3 to 5 years), depending on your hardware warranty and support terms from your vendor.

If you can figure out the above factors, you can proceed with calculating your operating IaaS costs. Keep reading here!

Adding value to SaaS

Software as a service is an entirely different animal from IaaS or PaaS. Implementing the latter two can be done (almost) with platforms available off the shelf and engaging a few consultants: Grab your favorite cloud automation platform (pick any: Eucalyptus, [Elastic|Open|Cloud]stack, Applogic, Abiquo, even HP SCA, throw in volume servers and storage, host on a reliable DC and you are good to go).

On the other hand, SaaS is something you have to:

  1. Conceive. IaaS and PaaS are self explanatory (infrastructure and platform: Virtual computing and database/application engine/analytics for rent); SaaS is… everything: from cloud storage to CRM for MDs.
  2. Implement: SaaS is not sold in shops. You have to develop code. This means, finding talented and intelligent humans to write code, and keep them with you throughout the project lifecycle.
  3. Market: Finding the right market for your SaaS is equally important to building it. SaaS is a service; services are tailored for customers and come in different sizes, colours, flavors. One SaaS to rule them all does not work.
  4. Sell: Will you go retail and address directly end customers? Advertising and social media is the road to go. Wholesale? Strike a good revenue sharing deal with somebody that already has customers within your target group, say, a datacenter provider or web hosting.
  5. Add some value to your SaaS. Cloudifying a desktop application brings little value to your SaaS product: It’s as good as running it on the desktop; the sole added value is ubiquitous access over the web. Want some real value? Eliminate the need to do backups. Integrate with conventional desktop software. Do auto-sync. Offer break-away capability (take your app and data and host somewhere else).
Let’s take two hypothetical examples: Cloud storage and CRM for doctors.
Cloud storage is a good offering for customers seeking a secure repository, accessible from everywhere. Let’s consider two approaches:
  • High end branded storage array with FC and SSD disks
  • 5-minute snapshots, continuous data protection
  • FTP and HTTP interface
  • Disk encryption
  • Secure deletion
The second approach would be:
  • WebDAV interface
  • Data retention
  • Daily replication
  • Auto sync with customer endpoints
  • Integrated content search

What’s wrong with the first approach? It is typical of the IT mindset: Offer enterprise IT features, like OLTP/OLAP-capable storage to the cloud. Potential customers? Enterprises that need to utilize high-powered data storage. Well, if you are an enterprise, most likely you’d rather keep your OLTP/OLAP workloads in house, wouldn’t you? Why bother?

The second approach offers services that are not delivered from your enterprise IT machinery. It’s added value to a cloud storage service and at the end of the day, they are deemed too expensive or complicated to implement in house. Potential customers? Enterprises that have not implemented these services but would seriously consider renting them.

Let’s consider now a cloud CRM for doctors. What would be some value added features for private MDs, apart from a database with customer names and appointment scheduling? I can think of a few:

  • Brief medical history of patient delivered to the doctor’s smartphone/pad. Can save lives.
  • List of prescribed medicines with direct links to medicare/manufacturer site. Patients can forget or mix up their prescribed drugs; computers never forget.
  • Videochat with patient.
  • Patient residence on Google maps and directions how to get there

VMs, IOPS and spindles

Sizing a virtual “farm” – how difficult can that be? You need to calculate CPU power, RAM, network bandwidth and storage. CPU sizing is easy; go for two-socket 4-core systems (best value for money these days), anything above 2.5 GHz clock is enough. RAM?  Say 6 GB of RAM per core, so foa dual-socket 4-core server, 48GB of RAM is good to go. Network? Add as many ethernet interfaces you can, however, 4 GbE per chassis will give you ~400 MBps (that’s megabytes per second) and a few more cards for iSCSI or FC and you’re OK (assuming your storage I/O throughput does not exceed 1/2 gigabyte per second – which is a lot).

The above configuration is deducted from my humble experience of running vSphere for two years now. The average footprint of the VMs we use is 2 vCPU, 4 GB of RAM with network throughput not exceeding 50Mbps per VM and storage I/O… well, like this:

We have left something out: Storage IOPS. That is, disk IOPS. To be more precise, disk pool cumulative IOPS.

Disk throughput is something different that IOPS. Throughput is sustainable and depends mostly on platter speed: 15000 RPM enterprise disks (FC or SAS) have twice the throughput of SATA disks (7200 RPM) – it’s physics. Throughput is essential when accessing sequential data (usually big files like video). But, in a VM environment, this is extremely rare. In addition, when people talk about throughput, they usually mean read operations, not write operations. And here is where trouble begins: Virtual hosts do far more write operations than read, and they are truly random, not sequential, so what matters here is not throughput but disk operations per second.

What is the IOPS profile of a VM? Well, here is some insight in this really really cool white paper (section 4.3): Write to read ration can be as high as 4:1. I can confirm this. Straight from the NAS box (an Ubuntu NFS server):

administrator@jerry:~$ iostat -m
Linux 2.6.32-28-server (jerry)  05/26/2011    _x86_64_     (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.13    0.11    0.00   99.76

Device:         tps    MB_read/s   MB_wrtn/s    MB_read    MB_wrtn
sda            0.03         0.00        0.00       1318       1571
sdb           27.10         0.02        0.16      89450     619959
sdc           42.63         0.01        0.63      43258    2385091
sdd           23.19         0.00        0.05       6310     200962
dm-0         236.66         0.04        0.84     139017   3206013
dm-1           0.12         0.00        0.00       1316       1571
dm-2           0.00         0.00        0.00          0          0

This is exaggerated by the fact that the NFS server has 8 GB of RAM, a large portion of it is used for caching read operations. Write operations are substantially more that read ops.

What is the best disk pool design for this kind of workload? Obviously, we need lots of disks to accomodate for VM disk capacity. To begin with for an environment sustaining up to 50 VMs with thin provisioning, assuming an average of 200 GB allocated capacity per VM and 10% utilization, 1.5TB is enough (10% X 200GB = 20 GB net capacity per VM, total 20 GB X 50 VM = 1TB; add 50% for one year VMDK inflation). So we need 1.5TB in some sort of RAID to tolerate disk failure. What kind of RAID should we use?

The problem with RAID and write operations is that a single write op is translated to multiple write ops in disk drives. In RAID0 (concatenation/striping), a data block is broken into several write operations, with each ending up to a single disk, so the data stream is sent as is to platters. In RAID1, a single write operation must be written to two drives and in RAID5 to at least three drives (two data and the parity). This is shown in the following figure:

For the sake of simplicity, we assume that a server on top sends small write operations to the disk controller, with each operation marked with a different color. In RAID0, all write ops are sent to disks: Data are not written to more than one disk. In RAID1, the effective bandwidth is halved: With half the number of write requests, the disks are equally saturated, and in RAID5 with the same number of disks, the effective bandwith is reduced even further. In all three cases, the physical disks run at their maximum I/O but the effective IO is totally different.

The best strategy is RAID0. Well, we can’t do that, if a disk dies it’s all over. The next best configuration is RAID1, and this is the layout we have chosen for our vSphere environment: Our NAS are presently… Ubuntu NFS servers (single quad core CPU, 8 GB RAM). Disks are mirrored via hardware RAID which are then grouped into large concatenated (RAID0) logical volumes with LVM. Disk hot sparing is handled by the hardware RAID controller avoiding cranky LVM code. It has worked perfectly for the past year with more than sufficient performance: Disk throughput is more than enough and latency is kept at ~5ms at all times:

We have selected to setup two NAS boxes, one with 300GB SAS 15000 RPM drives and one with 500GB SATA 7200 RPM disks: A quick ‘n’ small disk pool and a slow but big one. All disk volumes are setup the way described above, with two hot spare disks per server. It just works and feels like it’ll run forever…