lev_lafayette's blog

Password Praise in the Future Tense

Apropos the previous post, I am coming to the conclusion that University's are very strange places when it comes to password policies. Mind you, it shouldn't really come to much of a surprise - the choice of technologies adopted are often so mind-bogglingly strange one is tempted to conclude that the decisions are more political than technical. Of course, that would never happen in the commercial world. All this aside, consider the password policy of a certain Victorian university.

The constellation is changed, the disposition is the same

Ars Technica has reported of a relatively small GPU-Linux cluster which can crack by brute force standard eight-character MS-Windows passwords in under six hours. There are, of course, a reasons and caveats. Firstly, as online servers will typically block repeat password attempts, is system is most effective against offline password hashes, which then of course can be used for online exploits.

The Danger of Reusing Old Scripts

Did you know you can bring down an entire HPC cluster with an old script? Well, this week I had such an experience. As the systems administrator for a seriously aging cluster with over 800 post-graduate and post-doctoral researchers, "stress" is a normal part of daily life (for future reference: it's probably killing me).

Batchholds, Leap Seconds, and PBS Restarts

It is not unusual for a few jobs to fall into a batchhold state when one is managing a cluster; users often write PBS submissions with errors in them (such as requesting more core than what is actually available). When a sysadmin has the opportunity to do so they should check such scripts, and educate the users on what they have done wrong.

Foreword to Sequential and Parallel Programming with C and Fortran by Dr. John L. Gustafson

It is finally time for a book like this one.

When parallel programming was just getting off the ground in the late 1960s, it started as a battle between starry-eyed academics who envisioned how fast and wonderful it could be, and cynical hard-nosed executives of computer companies who joked that “parallel computing is the wave of the future, and always will be.”

Reviving a Downed Compute Node in TORQUE/MOAB

The following describes a procedure for bringing up a compute node in TORQUE that's marked as 'Down'. Whilst the procedure, once known, is relatively simple, investigation to come to this stage required some research and to save others time this document may help.

1. Determine whether the node is really down.

Following an almighty NFS outage quite a number of compute nodes were marked as "down". However the two standard tools, `mdiag -n | grep "Down"` and `pbsnodes -ln` gave significantly different results.

NFS Cluster Woes

A far too venerable cluster (Scientific Linux release 6.2, 2.6.32 kernel, Opteron 6212 processors) with more than 800 user accounts makes use of NFS-v4 to access storage directories. It is a typical architecture, with a management and login node with a number of compute nodes. The directory /usr/local is on the management node and mounted across to the login and compute nodes. User and project directories are distributed two storage arrays appropriately named storage1 and storage2.


[root@edward-m ~]# cat /etc/fstab

Can processes survive after shutdown?

I had a process in a "uninterruptible sleep" state. Trying to kill it is, unsurprisingly, unhelpful. All the literature on the subject will say that it cannot be killed, and they're right. It's called "uninterruptible" for a reason. An uninterruptable process is in a system call that cannot be interrupted by a signal (such as a SIGKILL, SIGTERM etc).

Deleting "Stuck" Compute Jobs

Often on a cluster a user launches a compute job only to discover that they have some need to delete it (e.g., the data file is corrupt, there was an error in their application commands or PBS script). In TORQUE/PBSPro/OpenPBS etc this can be carried out by the standard PBS command, qdel.


[compute-login ~] qdel job_id

Sometimes however that simply doesn't work. An error message like the following is typical: "qdel: Server could not connect to MOM". I think I've seen this around a hundred times in the past few years.

Pages