The relationship between capital expenditure and risk in data centre projects can often be a strained one. Data centre culture in Europe treats technical risk as something to be feared and avoided at all cost! Afterall concepts like availability and resilience can be very difficult to prove with any degree of certainty.
As technical systems become increasingly complex, this avoidance of risk intensifies, requiring more levels of resilience to be added to the system. Again this adds more complexity, and so the cycle continues at ever increasing cost, with arguably less value to the client.
Although this approach to risk is ubiquitous in data centre system design, one element that can have far reaching consequences is the subject of heat in the data hall.
What is Data Centre Heat Rise?
In essence data hall heat rise is the anticipated supply air temperature increase onto the server racks within a data hall when cooling is interrupted. Although this is normally seen as a mechanical engineering issue, such interruptions are typically a result of electrical power failure across the facility.
The argument being what happens if there is a complete failure of the utility power supply.
In such instances, assuming that there is no means of maintaining cooling, temperature within the hall will begin to rise while the UPS supported server racks continue to produce heat. This heat rise will continue until power is reinstated and cooling once again becomes operational.
In some “hyperscale” data centre facilities, the heat gain could be in excess of 12MW in a single data hall. Dealing with scary numbers like this, it is no wonder that some consider heat rise to be an issue that must be avoided at all cost. But this raises the question what is that cost and what value does this offer to the client.
For this reason, perhaps it would be better for heat rise and similar technical issues not to be treated as an absolute system failure, but rather as a risk that can be properly managed.
What is the Risk?
Commercial risk in textbooks is normally referred to as the probability of a hazard occurring multiplied by the severity of that occurrence. By this definition a high impact event, such as a Tsunami maybe given a vanishingly small probability and therefore would be low risk. However in some locations could occur, as was found in the 2011 Fukishima nuclear power plant disaster. The probability of this occurring was not zero but was considered by the plant designers as unlikely.
The Fukishima disaster is an extreme example which occurred in an industry familiar with risk assessment. In the design of data centres, assessment of design risks is not so well developed and is often influenced by the subjective views of those designing or constructing it. This method is often referred to as qualitative risk assessment as opposed to quantitative which seeks to apportion numerical values to probability and severity. Quantitative risk assessment is inherently more difficult to achieve, however severity can be attributed to cost and probability could be linked to availability figures in data centre design.
Dealing with risk
The goal of commercial risk management should be to optimise the balance between the severity (commercial cost) of a hazard occurring, the probability of it occurring, and the cost of implement mitigation measures. It should be noted that in some cases, mitigation measures can be subject to further risks in terms of complexity, novelty and maintenance burden.
It would be reasonable to assume that the commercial risk associated with technical design decisions is insufficiently managed in data centre construction. This can be attributed to several factors including the absence of commercial information in the designers decision-making process and lack of client direction in terms of appetite to risk.
However, perhaps a more insidious reason for lack of commercial risk management is the designers and contractors own appetite for risk. For a designer, the goal is to demonstrate “reasonable skill and care” in their execution of appointment. In many cases this instils an aversion to risk into individual engineers who are not necessarily familiar with commercial considerations. This may also transfer over to general contractor appointed under design and build contracts.
In summary, designers and contractors are often incentivised to over engineer data centre projects to eliminate perceived risks to themselves.
It could be argued that in reality these risks belong to client and could impact the commercial success of the project.
Heat Rise – What is the big picture
Data centre cooling is fundamentally a “heat-in” versus “heat-out” equation. Under normal operation the heat exhausted from the server racks will be equal to the heat-removed by the cooling equipment. Therefore a period of time without cooling will result in a surplus of heat energy that must be removed later. The longer that period, the more heat must be dissipated later.
It is critical to understand that if there is a total cooling requirement of 1MW (for example), and a total cooling capacity of 1MW the cooling capacity exactly matches the server heat load. This is known as “right-sizing”.
However in this scenario there is insufficient additional capacity available to remove the additional heat gained in the event of a power interruption. Therefore the cooling equipment must be upsized to account for the additional heat dissipation, for example 1MW+10% (1.1MW). Failing to do this limits the system’s ability to recover and longer interruptions will require more additional capacity to recover quickly.
The second issue relates to the duration of cooling interruption; longer durations result in higher peak temperatures. The typical solution to this to use back-up generators to restore power to the cooling system. Because the generator is likely to start in approximately 20 seconds and the cooling system will take a further 15 seconds to resume, a typical interruption in cooling is likely to be 35 seconds. During this time there will be a heat rise in the data hall.
The level of heat rise during this interruption can be difficult to quantify, therefore data centre designers will often augment the cooling system. This could be by using one of the following options:
- Installation of chilled water buffer vessels pumps and pipework to maintain cooling water flow. In this scenario UPS supporting the pumps and possibly fans will maintain the system cooling.
- UPS supporting fans within the data hall will ensure that warm air is not localised at the servers.
- UPS supporting all mechanical cooling equipment.
All these options will require additional capital expenditure, and all are intended to eliminate the 35 seconds power interruption associated with the starting of the generators. The additional cost of installing these measures will reduce the profitability of the facility which is in itself a commercial risk.
Probability
As discussed previously the risk of heat rise can be split into two parts, probability and severity.
The probability of an occurrence of cooling interruption can be attributed to the availability of the utility power supply. Availability is a metric used in the design of data centres to quantify the percentage of time the system is operational and can be attributed on an annual or monthly basis.
Although it can vary by location, the availability of the electrical grid in the UK is approximately 99.997%. Therefore the period of time when utility power is unavailable is 0.003% of the year, or around 15 minutes and 46 seconds per year. However this could be limited to a single event or could be multiple events during the year. In either case, the cooling interruption would typically be around 35 seconds, when the system uses backup power generation.
It should be noted that this event is possible, but on average it is unlikely.
Severity
Severity in the context of heat rise relates to the peak temperature achieved and associated damage to the server racks, or possibly financial charges associated with the temperatures being outside of a specified range. Therefore severity may vary between projects depending upon costs incurred.
Effect on Commercial risk
If the financial penalty of heat rise is minimal, for example £10k per incident and the probability was one incident per year, over the life of the data centre facility the anticipated cost associated with this risk would be approximately £150k over its 15 year life. In this case it is clear that the capital cost of augmenting the cooling system must be less than £150k to offer value to the client.
Further to this server technology has developed over the past ten years and can now operate a higher temperatures, albeit at a higher air flow rate.
Predicting the level of heat rise
Because of the wide range of variables that need to be considered, predicting peak heat rise is notoriously difficult. The current accepted method of determining heat rise in a data hall is to use computational fluid dynamics (CFD), which is a complex and often expensive method to determine patterns of air movement. The problem is that to determine heat rise a “transient” CFD is required.
Transient simulations incorporate time into the simulation, with each second of simulation leading onto the next. This requires a large amount of data processing power and will often require the CFD model to be simplified and parameters removed to ensure the stability of the calculation.
The resulting simulation is therefore a highly simplified representation of reality. This can be observed by using different CFD platforms simulate the data hall. Each is likely to offer different results depending upon the model detail and parameters used.
CFD is therefore an indication of overheating severity.
Key elements that should be included in a CFD include:
- Air temperatures and densities
- Thermal mass of racks, equipment and where possible building fabric
- Leakage and bypass air positions
It is worth noting that there is a spreadsheet method of calculating heat rise known as the “Well-Mixed Air Method” this calculation has been seen to offer similar results to the Transient CFD calculation at a fraction of the time and cost.
JDA Support
JDA have developed an industry leading expertise in reviewing CFD simulations and heat rise issues. For JDA value to the client is central to the design process and includes review and evaluation of possible commercial risks to the client organisation. Few consultancies approach data centre projects in this way, which allows JDA to offer a unique service.