Some years ago, I had my first experience of working in a team of thirty-odd people whose task was to produce a major piece of software for a customer. I started out as the Test Manager, and finished up as the Systems Engineer on the project, after all the other systems engineers had jumped ship when it became difficult.
I learned quite a few lessions about how to not write software, and in the years since, these lessons have been valuable whenever I've had to either cut code myself or work with software developers. If the stories below seem bitter and twisted, then that's probably true - but I have learned some valuable lessons.
Never trust contractors
This is far too much of a generalisation, because I've met many great contractors. But occasionally, it does go wrong.
Since we do all like to eat and have a roof over our heads, a contractor's greatest responsibility is to keep him or herself in employment. This means having good-sounding things to put on your CV. I found this out when one of ours wrote a particular program as a Windows Service - all very clever, and a great thing on your CV.
But it didn't work. As tester, I had to kill it from time to time when it went wrong, and Windows wouldn't let me - it had System privilege, which is the highest of all, and hence unkillable. I had to re-boot the PC to stop it. When I explained the problem to him, his answer was that he had no idea how to fix that. Soon after, he left for another employer.
We ended up throwing out his code and starting again from the beginning.
Many years later, he was job-hunting again, and I was given his CV and asked if we should employ him. My answer, which was polite, can be better imagined than described. The moral is that the industry in Australia is not very big, and hence it's a bad idea to get people off-side.
Testing too early
At one time, we were under pressure to get finished, but the software was a very long way off working. Because myself and my only other tester had not too much else to do, and to help give the customer the idea that progress was being made, we were told to start testing.
The problem was that the software was way short of being usable by anyone who didn't know its internal workings. It would keep locking up. The senior developer would always fix it by rushing over to our PCs, opening up a command prompt, typing furiously, then announcing that it was fixed and away we would go for a while - which was not how we were going to learn anything about how to fix it ourselves.
This became more and more frustrating. Eventually, I dealt with it by pointing out that she would have to be delivered to the customer along with our software, because it didn't work unless she was there. (This is rather close to what psychologists call RET - Rational Emotive Therapy.)
Things changed quickly after that.
No time to review the code
When I first started on the project, they had been writing code for about a year - after throwing away the first two attempts when managers decided to change the operating system - it was now on Windows NT4 (which tells you how long ago this was). The customer was very unhappy that no formal code peer reviews (a normal part of professional coding) had ever been done, and threatened to close us down, but offerred a get-out solution of one review per developer.
To my horror, the project manager said no, because we were already years late and that would make us even later. I have no idea what he was high on, but to me the trade-off between a few weeks delay and being shut down for months or years was obvious.
As Test Manager, I was responsible for the review policies. I had read the current code review policy, and it was awful - every module (say, twenty or more lines of code) would be reviewed by a committee of four or more for two days. My predessor, with no practical experience of getting a real job out the door, had written it. How on earth could you spend that much time on reviews? And (in the absence of coding standards) wouldn't it just end up with everyone saying that the author was wrong because they would have done it differently?
So, what I did was to issue a new review policy the night before the audit that would have shut us down. My new policy said there would be no meetings, reviews would consist of notes in logbooks (and these already existed).
So we survived the audit, although the customer wasn't too happy.
I received quite a few lectures about the importance of database integrity. OK, it's an abstract sort of a comment; in this case, you need to know the background. The idea was to set up all the rules for what was valid in the equipment that we were controlling. For example, if you're switching a light on and off, then "on" and "off" are valid instructions, but "fall out of the socket" isn't valid because you can't make the light do that. So before sending an instruction to the light, database integrity means checking that the instruction is "on" or "off", not anything else. Likewise, if the light bulb is smart enought to be able to tell you whether it's on or off, you'd want to reject it replying with "I'm green", because your software may not have been designed to cope with an unexpected answer.
Now, this approach may be good software design, but in our case, the problem of what to do after some faulty data arrives wasn't ever thought through properly, which caused us a lot of grief.
Our hardware had what was effectively a design mistake - data transfer had absolutely no error detection or correction. So, sometimes we would generate valid data but our hardware would receive it with some errors (and anything could then go wrong). Other times, the configuration sent by the equipment to us would be received with errors, and our software would have to stop, throw it away, and try again - losing ten or more minutes over the very slow data connection. Occasionally, the hardware would be so corrupted that we could not regain control, and we had to use the manufacturer's own basic software to bring it back.
The problem reached its peak when, after going into use, there was a firmware upgrade. As the firmware was replaced, all configuration data was supposed to be removed to prevent invalid data. But of course, in the real world, mistakes are made. We first heard about this when there were reports of hardware with faults that couldn't be recovered - when the users tried to load a new configuration, but they couldn't, because the software found the old one was invalid and refused to go any further.
In effect, we were asking the equipment whether something was zero (meaning OFF) or one (meaning ON) and the answer coming back was forty-two, due to the storage of the variables being moved around by the firmware update.
This meant that we had two different problems - the faulty configuration and a design mistake. The design mistake was that our software should not have been checking the validity of the old configuration when the user was trying to overwrite it with a new one.
It took me a few hours and about ten lines of code to fix it, by writing a little utility which reset just that one variable in the hardware to its correct value. But the problem should never have got that far, and would not if there had been a design for error handling.
Some years later, I had the job of porting our software from Windows NT to XP. The general rule for this kind of porting is to just try it out and see what happens, since 32-bit applications should just work. And it did work - except that at one particular place, when you tried to select a value for one variable, everything locked up, with CPU activity going to 100% and the screen no longer updating at all.
It turned out that the graphic designers had asked for what's known as a cascading context menu, with the current value of that one particular variable displayed at the cascade location. At the time, there was no way to do this in the C++ programming language, but the developer found an undocumented back door. The problem was that XP had closed the back door, and locked up instead.
The engineer who had done the graphic design (in PowerPoint) still worked for us, so once I figured out how to fix the problem by placing the current value of the variable elsewhere, I asked his opinion. He was completely happy with that - and nobody had ever asked him about it.
There's an obvious lesson there about software people who try to solve problems regardless, without going back to the designers and asking questions.
I knew that there had always been a problem with our source code. One file was very big - something rather over 100MB - more than everything else combined. It played havoc with our software configuration, because it was too big for the configuration control software to deal with, and had to be archived manually each time, incurring quite a bit of extra work, not to mention the risk of one day making a mistake when you have extra manual steps.
When I was doing some very limited software updates, I had to make one change to this file, to change our GUI a little bit. And when I did this, the file size doubled.
So there was the smoking gun, telling me how it had got to be so big in the first place. A few hours with Google led me to find out that it was a known fault in the C++ IDE (integrated development environment), and that there was a simple workaround (remove one variable, save, then put it back and save again) to get the original file back. So I did this, and the file shrank to about 10MB. What is more, the DLL that it complied into also shrank, from 10MB to 2MB, and the software then ran faster.
The frustrating thing was that the workaround which I found was published the year before we did the original software development. So, apparently nobody had ever searched for a solution, just spent their time documenting a way to cope with it - and spent far more time documenting the workaround than it would have taken to find the fix.