January 2021 – Notwitsend

Over the last few years I’ve seen a number of articles on how, as IT professionals, we can work to build users’ trust in the systems we produce. Clearly this is important, as a system that is not trusted by its targeted users will not be used, or will be used in efficiently.

A system trusted by a user, is one that the user feels safe to use, and trusts to do tasks without secretly executing harmful or unauthorised programs
Wikipedia

This seems an obvious topic of interest to IT professionals.

For instance, if customers of a bank do not trust that the mobile app allowing them to interact with their funds cannot be trusted to accurately complete requested actions, it won’t be used.

But there’s a flip side to this trust coin that is not often talked about or studied: how do we design systems that we can be sure will not be trusted by all-too-trusting humans when it is inappropriate or unsafe to do so.

We actually experience this in our everyday lives, often without thinking about what it really means.

One example: compared to Google Maps on my phone, I have lower trust in my car’s navigation system to get me to the destination by the quickest route. As an IT professional, I know that Google Maps has access to real-time traffic information that the built-in system does not, and so I will rely on it more if getting to my destination in a timely manner is important.

My wife, who is not in the IT business, has almost complete trust in the vehicle navigation system to get her where she wants to go without making serious mistakes.

In a case like this, it’s not really of monumental important which one of us can be accused of misplaced trust in a system. But there are cases where it’s very important.

For instance, current autonomous vehicles available to the general public are SAE level 3, which means they must be monitored by a human who is ready to intervene should it be necessary. If a Tesla computer cannot find the lane markings, it notifies the driver and hands over control.

But how many reports have we seen of Tesla drivers who treat the system as though it can take care of all situations, thereby making it safe for them to engage fully in other activities from which they cannot easily be interrupted?

Tesla Autopilot crash driver ‘was playing video game’

Tesla’s Autopilot lulled driver into a state of ‘inattention’ in 2018 freeway crash

One could say “there will always be stupid people” but this just sweeps the important problem under the rug: how do we design systems which install an appropriate level of trust in the user? Clearly the Tesla system in these cases, or the context of the system’s use, instilled too much trust on the part of the user.

An interesting study done by the Stanford University School of Engineering addresses this topic in an interesting way and with informative results.

The engineers looked at how people’s moods might affect their trust of autonomous products, such as smart speakers, to discover a complicated relationship.

Unsurprisingly the study found that a user’s opinion of the technology is the biggest determining factor in the user’s trust in the product. Surprisingly, the study also found that users who had either a positive or negative opinion of the technology tended to have higher levels of trust.

“An important takeaway from this research is that negative emotions are not always bad for forming trust. We want to keep this in mind because trust is not always good,” said Liao, who is now an assistant professor at the Stevens Institute of Technology in New Jersey and lead author of the paper.

This makes something clear: if we are to design systems that are to be trusted appropriately, we must understand that the relationship between the user’s knowledge, mood, and opinion of the system is more complex than we might imagine. We need to take into account more than just a level of trust we can install through the system’s interaction with the human, but other confounding factors: age, gender, education. How to elicit and use this information in a manner that is not intrusive and doesn’t itself generate distrust is not currently clear–more study is needed.

As IT professionals, we must be aware that instilling a proper level of trust in the systems we build is important and focus on how to achieve that.

I have a fondness for watching documentaries about aviation disasters.

Now, before you judge me as someone with a psychological disorder–we all slow down when we see an accident on the highway, but planes crashing into each other or the ground?–let me explain why I watch these depressing films and what it has to do with IT work.

I should start by noting that, as a private pilot, I have a direct interest in why aviation accidents happen. Learning from others’ mistakes is an important part of staying safe up there.

Ever see a car accident happen and find yourself compelled to Google what happened? Dr. Mayer says this is also our survival instincts at work. “This acts as a preventive mechanism to give us information on the dangers to avoid and to flee from,” he says.
https://www.nbcnews.com/better/health/science-behind-why-we-can-t-look-away-disasters-ncna804966

But, then, there’s another reason I watch the documentaries that’s only recently become clear to me: seeing how mistakes are made in a domain where mistakes can kill can can be generalized to understand how some mistakes can be avoided in other domain where, while the results might be less catastrophic to human life, are still of high concern.

In my case, and likely in anyone’s case who is reading this, that’s the domain of IT work.

The most important fact I take away from the aviation disaster stories is that disasters are rarely the result of a single mistake but result from a chain of mistakes, any one of which if caught would have prevented the negative outcome.

Let me give an example one such case and see how we, as IT professionals, might learn from it.

On the night of 1 July 2002, Bashkirian Airlines Flight 2937, a Tupolev Tu-154 passenger jet, and DHL Flight 611, a Boeing 757 cargo jet, collided in mid-air over Überlingen, a southern German town on Lake Constance, near the Swiss border. All 69 passengers and crew aboard the Tupolev and both crew members of the Boeing were killed.^[4]

On the night of July 1, 2002, two aircraft collided over Überlingen, Germany, resulting in the death of 71 people onboard the two aircraft.

The accident investigation that followed determined that the following chain of events led to the disaster:

The Air Traffic Controller in charge of the safety of both planes was overloaded as the result of the temporary departure of another controller in the center.
An optical collision warning system was out of service for maintenance but the controller had not been informed of this.
A phone system used by controllers to coordinate with other ATC centers had been taken down for service during his shift.
A change to the TCAS (Traffic Collision Avoidance Systems) on both aircraft that would have helped–and which was derived from a similar accidents months earlier–had not yet been implemented.
The training manuals for both airplanes provided confusing information about whether TCAS or the ATC’s instructions should take priority if they conflicted.
Another change to TCAS, which would have informed the controller of the conflict between their instructions and TCAS instructions was not yet deployed.

Many issues led to the disaster (which thankfully, have been resolved as of today)–but the important thing to note is that if any one of these issues had not arisen, the accident would likely not have happened.

That being true, what can we learn from this?

I would argue that, in each case, the “system” of air traffic control, airplane systems design, and crew training taken as individual items, each could have recognized that each issue could lead to a disaster and should have been dealt with in a timely manner. This is true even though each issue by itself could have been (and probably was) dismissed as being of little important by itself.

In other words, having a mindset that any single issue should be addressed as soon as possible without detailed analysis of how it could contribute to a negative outcome might have made all the difference here.

And here is where I think we can apply some lessons from this accident, and many others, to our work on IT projects.

We should always assume that if, absent evidence to the contrary, a single issue during a project could result in negative implications that are not immediately obvious, it should be addressed and remediated as soon as practicable.

The difficult part of implementing this advice clearly results from questioning whether a single issue could affect the entire project, and the cost of immediate remediation vs. its cost. There is not an easy answer to this–I tend to believe that unless there is a strong argument showing why a single event cannot become part of a failure chain, then it becomes something that should be fixed now. Alternatively if the cost of immediate remediation is seen as less than the cost of failure, then the issue can be safely put aside–but not ignored–for the time being.

To put this into perspective in our line of work:

Let’s imagine a system to be delivered that provides web-based consumer access to a catalog of items.

Let’s further imagine that the following are true:

The catalog data is loaded into the system database using a CSV export of data from another system of ancient vintage.
Some of the data imported goes into text fields.
Those text fields are directly used by the services layer.
Some of those text fields determine specific execution paths through the service layer code.
That service code assumes the execution paths can be completely specified at design time.
The UI layer is designed assuming that delivery of catalog data for display will be “browser safe”–i.e., no characters that will not display as intended.

This is a simple example, and over-constrained, but I think you can see where this is going.

If the source system has data, to be placed in the target system text fields. has characters that are not properly handled by the services layer and/or the UI layer, bad actions are likely to result.

For instance, some older systems permit the use of text documents produced in MSWord that promote raw single- and double-quote characters to “curly versions” and take the resulting Unicode data in raw form. Downstream this might result in failure within the service layer or improper display in the UI layer.

Most of us, as experienced IT professionals, would likely never let this happen. We would sanitize the data at some point in the process, and/or provide protections in the service/UI layers to prevent such data from producing unacceptable outcomes.

But, for a moment, I want you to think of this as less than an argument for “defense in depth” programming. I want you to think of it as taking each step of the process outlined above as a separate item without knowing how each builds to the ultimate, undesirable outcome, and deciding to mitigate it on the basis of the simple possibility that it might cause a problem.

For example, if the engineer responsible for coding the CSV import process says “the likelihood of having problems with bad data can be ignored or taken care of in the services layer”, my suggested answer would be “you cannot be sure of that, and if we cannot be sure it won’t happen, you need to code against it”.

And, I would give the same answer to the services layer engineer who says “the CSV process will deal with any such issues”. You need to code against it.

It may sound like I’m simply suggesting that “defensive coding” is a good idea–and it is. But–and perhaps the example given is too easy–I would argue that the general idea I am suggesting is that you need to have a mindset that removes each and every item in a possible failure chain without knowing, for certain, that it could be a problem.

This suggestion is not without its drawbacks, and I would encourage you to provide your thoughts, pro or con, in the comments section of this blog.

In the meantime, I’ll be over here watching another disaster documentary….