Data Centre Preparation: A Summary
When installing a new system (whether HPC, cloud, or just a bunch of servers, disks, etc), they must be housed. Certainly this can be without any specialist environment, especially if one is building a small test cluster; for example with half-a-dozen old but homogeneous systems, each connected with 100BASE-TX ethernet to a switch etc. In fact, this was exactly what VPAC used to have as a training cluster (Wayland) for new systems administrators; they would be given an instruction sheet to rebuild the cluster, install a compiler (eg, GCC) and maybe a couple of applications, and run a couple of MPI jobs. Once this was done, they certainly had a good grasp of the overall architecture of HPC systems and could work on a production machine with a more confidence. All of this is fine, and indeed recommended, for test systems. However, for a real production system a real data center is needed. Often, far too often, a small departmental-scale cluster has been discovered in a disused cupboard having long been put together by an enthusiastic post-graduate who has since gone on to do better and brighter things. Real clusters and clouds need a real data center for real use.
Hopefully one is convinced of the need of a specialised room for such systems - a Data Centre (DC). The term 'redundancy' will come up a lot in discussion of data centers, and for the very good reason that *continuity* is of critical importance. An HPC cluster can lose a node which is a terrible shame (especially for a person running a job on that node at the time), and the system as a whole keeps operating. A virtual machine could crash on a Cloud platform, but the IaaS keeps going. But even in these systems there is critical components (eg, management and head nodes) which if they go down, the system can become unusable. A DC is built around the principle of reducing the chance of such critical components of suffering failure and by having backup routes for when they do. Even if one already has a DC, or is using a third-party, is is still extremely important to review this section to ensure that the critical infrastructure requirements are being met. In doing so, involve all parties to their relevant extent; keep the finance people involved and informed of the cost, keep the operations and network people involved on their technical details, etc.
A DC should be in a secure location, and have redundant and backup power supplies, and environmental control systems - air conditioning and fire control isas a must. There also should be redundant data and telecommunication connections. Housing of servers should be in standard 19-inch/48.26-cm racks [1]. By 'secure location' this should mean that there is restricted access and a policy that is enforced concerning access; VPAC had access to their DC controlled by an Quality Assurance policy that was part of their ISO 9001 certification. Access was governed by multiple levels of physical security, and an individual requirement that staff members had to present with any visitors, and that visitors had to sign in and out. 'Secure' is also an indication of how readily available support and replacement equipment is to the data centre. Being unable to fix a server due to the need of a quixotic screwdriver is frustrating to say the least. Always have a good supply of screwdrivers, bolts and nuts, network and power cables, etc.
The racks should include front and rear mounting posts, to prevent shear stress on the front mounting rails. In some cases, especially those with notable seismic activity, it is recommended that the racks themselves be bolted to the floor. Different servers take up a different number of rack units (RUs), and standard racks are capable of hosting different RUs, although 42 is the most common configuration. Each RU is 44.45 mm in height which, by delightful coincidence, is also equal to the former Russian measurement of a *vershok* (вершо́к) or "tip". In planning an installation, the total number of servers and racks required must be considered, along with switch and patch panels, and blank panels to prevent the escape of heat. It is highly recommended that the servers come with appropriate mounting rails for the rack unit. Arm units on the rear to tidy cables - which themselves should be of the right length - is also surprisingly useful. Not only are there common-sense reasons for having a tidy cable collection, as a poor distribution of cables can block heat exhaust.
A raised floor, an elevated tiled floor with additional structural support, above a solid substrate, will provide a region for electrical, mechanical, cabling etc, as well as providing for air distribution and circulation. The DC, even a small one, should have multiple panel lifters for access to underfloor region. The load-bearing capacity structural support must be able to support the weight of the racks, the servers, and human visitors - ensure that there is a big margin of error on the side of safety here, and that regular checks are made of the integrity of the support. The underfloor and ceiling area should also have cable trays for cable organisation and protection from hazards; protective covers and ventilation openings are useful in this context.
Management and planning of physical space is also important in this context. Layout of the racks obviously must ensure that there is sufficient space for access, installing new equipment, and a KVM crash trolley. However an additional consideration is the layout of "white space" so it has an optimal design for airflow and heat management. Whilst this is a complex science using CFD for optimal results in its own right, the basic principle from which everything can be elaborated is described as follows: "The flow rates of the cooling air must meet the cooling requirements of the computer servers" [2].
In nearly all cases cooling air enters a server rack through the front face and hot air exits from the rear face. Not only does this mean there should be specific hot-aisles and cold-aisles, it also implies that hot air can flow from top of a rack via the rear to the cooler front, requiring additional cooling, which is not optimal [3]. Perforated tiles on a raised floor are extremely useful, allowing for airflow to be pumped according to the layout of the servers, assuming constant cool air pressure in the underfloor region. The perforated tiles should be at the front of the hottest servers, with air conditioners supplying cold air at 13°C and the average temperature of the DC at around 24°C. If it all possible, distribute heat generating systems across the DC; HPC systems and blades in particular run warm. There are thermometers in the DC at select locations, and 'roaming items' to check the temperature individual racks and different locations within the racks (the top will be hotter than the bottom, the rear hotter than the front).
Logically enough the space and location of the computer room air conditioning unit (CRAC) also has to be accounted for. It is essential to ensure that the hot air goes directly to the cooling system without a mixture with cold air. An overhead ducted cooling system for example can have the return ducts in the hot aisles and the outlets in the cool aisles (or via the perforated tiles in the raised floor). Another small but beneficial act is to turn off systems that are no longer in use and switch off the lights when the room is not in use. The former can make significant differences depending on how many systems are currently generating heat without performing calculations. The latter may save only a few percent of the total heat and electrical impact, it is still sufficiently important to carry out given the minimal effort involved.
Calculation of power needs must also be undertaken [4]. Typically around 50% of a DC's electricity needs are consumed by the air conditioning unit, so keeping this number down is of significant energy and monetary significance. Around 35-40% can be expected to be spent on computer systems themselves (with high throughput systems on the upper end of the scale), and the rest on UPS charging. This precise quantity can be determined by an addition of the requirements of *all* the equipment in the DC over time; typically this is expressed as Kilowatt Hours (kWh), the kilowatts required and the time in hours (that is, volts multiplied by amperage multiplied by number of hours). Most manufacturer documents will specify the watts required by their equipment. Be aware that HPC systems often operate close to maximum utilisation and have a corresponding use of power, and plan for future growth.
Fitting out and moving data centers is an expensive process. It is imperative that when making an HPC or cloud deployment that a lasting choice is made initially. The review given here is fairly terse in an attempt to pack in as much critical information as possible. Access, space, heating, and power are the key criteria addressed here. Even if the data center is being managed by a different group to those who are doing administration of the systems, it is imperative that the necessary elements of this review are checked off. Again, it will save an enormous amount of time and money in the future if the right planning and preparation is done earlier.
[1] IEC 60297-5 Mechanical structures for electronic equipment - Dimensions of mechanical structures of the 482,6 mm (19 in) series*
[2] Suhas V. Patankar, Airflow and Cooling in a Data Center, Journal of Heat Transfer 132(7), 073001, Apr 27, 2010. See also Vanessa López, , Hendrik F. Hamann, Heat transfer modeling in data center, International Journal of Heat and Mass Transfer, Volume 54, Issues 25-26, December 2011, Pages 53060-5318
[3] Michael K. Patterson, Robin Steinbrecher, Steve Montgomery, "Comparing Data Center & Computer Thermal Design", ASHRAE Journal, April 2005, p38-42
[4] Richard Sawyer, "Calculating Total Power Requirements for Data Centers", American Power Conversion, 2004 and Neil Rasmussen, "Calculating Space and Power Density Requirements for Data Centers", Schneider Electric, undated.