Wednesday, January 20, 2010

Software efficiency IS part of data center efficiency

 A few companies are spending effort to optimize hardware and processes towards the reduction of energy in the data center.  Some companies have facilities based projects and others IT based, with a few doing both.  The graphic below from a white paper at Emerson depicts the current thinking in that the further to the left you can save a watt, the more actual savings you receive (assuming right scaling). Thus the further to the right of the graph you save a watt, the less additional savings you receive through this cascade.

Graphic from white paper at  

What is missing from this graph, and not often spoken of in data center efficiency conversations, is software efficiency.  I am not talking about virtualization which is allowing you to make better use of hardware and is thus part of the "server component" in the cascade chart above.  I am talking about actual algorithm optimization. 
There are still code shops out there that attempt to optimize their code but as most of us know, most code is rarely optimized these days.  Code optimization has long since fallen under the Moore's Law knife of accounting; "It's cheaper to buy faster hardware than to pay for developer time".  Often time to market pushes aside code optimization (and some times even debugging).

If converting an application from say perl to C++, or simply turning on some compiler options, allows a bit of code to finish in 30 minutes instead of 60 and/or to use less RAM then you likely have a significant measurable power difference that can then trickle down the cascade effect.  I am not by any means saying that all perl or java code should be converted to C or C++ (or any other language).  Just that if you have a piece of code that takes a significant amount of time to run OR is run a significant number of times, spending some effort to optimize it can result in significant savings.

Here is another example.  Say there is a large web application written in some fictional interpreted language (say Perava ;).  The bulk of this code is infrequently hit and performs perfectly well.  But there is one function that is hit repeatedly for every web page.  This function takes 4 seconds to complete and to meet performance requirements for the number of web customers etc. the company deploys 10 redundant servers.  Each server uses 300 watts for a combined 3Kw for servers or following the above cascade up, about 8.5Kw in the data center.
If the optimization of the one segment of code or translation to some fictional compiled language managed to cut the run time in half (overly simplistic I know), so that only 10 servers were needed, and thus just 1.5Kw for servers or just 4.25Kw in the data center.

I have seen this same thing in high performance grid computing as well.  By just turning on optimization flags when compiling programs that are run 100k times a day for minutes at a time, managed to eliminate the need to expand the compute cluster.

We are starting to see some real push towards compiler optimizations particularly around auto-parallelization, which with modern chips is proving hugely successful.  Because of this cascade multiplier effect we can see some real gains on software optimization efforts.

There is another area for optimization which is between the software and hardware layer.  We are used to getting the right hardware to meet the software specs, but what I am thinking about is building the software towards hardware specs.  For example instead of building a single threaded process that would require a very fast processor, build a multy threaded process that can take advantage of a lower watt multi core processor (yes this is already the case for commercial software but in house software can do the same).

Probably a much better and direct example is an in house application that currently, to meet performance requirements keeps all the data in RAM for a particular job.  This might be able to be rewritten to use a combination of much less RAM, multi processing, and SSD (Solid State Disks) to reach even better performance than the original.   I am not talking about just putting SSD drives in a server or using SSDs for SWAP and running the application, but rather the application is altered to use the much faster access times of the SSD.  Bioinformatics, geo-imaging, weather simulations, and many other large data set research programs currently use large RAM systems, MPI clusters, or other methods to handle the large data sets.

Here is a great article on how facebook is drastically increasing performance by cross compiling PHP to C++.

Thursday, January 7, 2010

Resolve to Measure PUE in 2010

Measure PUE in 2010! Why measure PUE after I blog about it not being perfect, just last month?
Well, it is better than nothing. Actually it is more than that, it is the minimum insight into your data center.
Think of driving your car late at night in an unfamiliar location. You have no idea were the next open gas station will be:
How comfortable would you be without a gas gauge?
This auto analogy isn't too far off. Many corporate data centers have some monitoring system to let someone know :
  • when the temperature gets too high
  • when there is a water leak
  • generator status
  • UPS (battery) status
And then on the IT side there is often some monitoring system or other keeping track of most of the critical servers, storage, backups, firewalls, even critical applications. Often there are all kinds of graphs and analysis of these systems with trends and pretty graphs over time showing how much work has been done in the data center.
With all this often real time monitoring and trending, few corporate IT managers have no idea what their date centers PUE is for any point much less over time. Though PUE would be more closely related to MPG than a gas gauge. PUE gives you a base line. As an IT or facilities director, hopefully you are looking at reducing costs in this economy for 2010. The data used to calculate PUE can easily be used to calculate energy cost of the data center.
IT directors are used to showing costs of new projects amortized over time showing costs for:
  • computers
  • storage
  • network
  • cables
  • support
  • replacement hardware
  • helpdesk
  • even data center space
but usually not electricity. Some Directors are starting to include anticipated electric costs in the projections but most still don't consider it because the costs are not in their budget. Getting individual project electric cost projections can be much more difficult than measuring the data center as a whole. Planning a major new project for 2010, maybe some virtualization or new storage. Measure PUE before and after implementation.

Getting back to the gas gauge theme, my first cars were all older than me. They all pretty much had the same instruments in the dash: speedometer, gas gauge, odometer, and check engine light. My dad was a master mechanic from the navy and he had a term for that check engine light. He called it an "idiot light" because you were an idiot to drive with it on. If you don't measure PUE then you are relying on the "idiot light" which for many data centers comes in the form of high temperature alerts, and by then things are too late.

I have several vehicles now, from my motorcycle to a prius. Now that prius puts that gas gauge and "idiot light" to shame. There is a row of lights to tell me what is going on with the engine, check engine, change oil, change filter, tire pressure, rotate tires. Right in the middle of the car is a computer display showing instantaneous MPG, MPG for since last reset, a graph of MPG for the last 30 minutes, and a gas gauge.

This is the kind of information you want for your data center. A nice little graph showing how much energy is being used every 5 minutes. If you don't measure the PUE for you data center than it is worse than driving a 1955 Cadillac with a broken gas gauge (you do not want to guess how much gas you have in a car that gets less than 10 MPG).
Lets get started in 2010 by measuring the PUE.