Categories
SogetiLabs Posted

Aviation Disasters and the IT World

I have a fondness for watching documentaries about aviation disasters.

Now, before you judge me as someone with a psychological disorder–we all slow down when we see an accident on the highway, but planes crashing into each other or the ground?–let me explain why I watch these depressing films and what it has to do with IT work.

I should start by noting that, as a private pilot, I have a direct interest in why aviation accidents happen. Learning from others’ mistakes is an important part of staying safe up there.

Ever see a car accident happen and find yourself compelled to Google what happened? Dr. Mayer says this is also our survival instincts at work. “This acts as a preventive mechanism to give us information on the dangers to avoid and to flee from,” he says.

https://www.nbcnews.com/better/health/science-behind-why-we-can-t-look-away-disasters-ncna804966

But, then, there’s another reason I watch the documentaries that’s only recently become clear to me: seeing how mistakes are made in a domain where mistakes can kill can can be generalized to understand how some mistakes can be avoided in other domain where, while the results might be less catastrophic to human life, are still of high concern.

In my case, and likely in anyone’s case who is reading this, that’s the domain of IT work.

The most important fact I take away from the aviation disaster stories is that disasters are rarely the result of a single mistake but result from a chain of mistakes, any one of which if caught would have prevented the negative outcome.

Let me give an example one such case and see how we, as IT professionals, might learn from it.

On the night of 1 July 2002, Bashkirian Airlines Flight 2937, a Tupolev Tu-154 passenger jet, and DHL Flight 611, a Boeing 757 cargo jet, collided in mid-air over Überlingen, a southern German town on Lake Constance, near the Swiss border. All 69 passengers and crew aboard the Tupolev and both crew members of the Boeing were killed.[4]

On the night of July 1, 2002, two aircraft collided over Überlingen, Germany, resulting in the death of 71 people onboard the two aircraft.

The accident investigation that followed determined that the following chain of events led to the disaster:

  • The Air Traffic Controller in charge of the safety of both planes was overloaded as the result of the temporary departure of another controller in the center.
  • An optical collision warning system was out of service for maintenance but the controller had not been informed of this.
  • A phone system used by controllers to coordinate with other ATC centers had been taken down for service during his shift.
  • A change to the TCAS (Traffic Collision Avoidance Systems) on both aircraft that would have helped–and which was derived from a similar accidents months earlier–had not yet been implemented.
  • The training manuals for both airplanes provided confusing information about whether TCAS or the ATC’s instructions should take priority if they conflicted.
  • Another change to TCAS, which would have informed the controller of the conflict between their instructions and TCAS instructions was not yet deployed.

Many issues led to the disaster (which thankfully, have been resolved as of today)–but the important thing to note is that if any one of these issues had not arisen, the accident would likely not have happened.

That being true, what can we learn from this?

I would argue that, in each case, the “system” of air traffic control, airplane systems design, and crew training taken as individual items, each could have recognized that each issue could lead to a disaster and should have been dealt with in a timely manner. This is true even though each issue by itself could have been (and probably was) dismissed as being of little important by itself.

In other words, having a mindset that any single issue should be addressed as soon as possible without detailed analysis of how it could contribute to a negative outcome might have made all the difference here.

And here is where I think we can apply some lessons from this accident, and many others, to our work on IT projects.

We should always assume that if, absent evidence to the contrary, a single issue during a project could result in negative implications that are not immediately obvious, it should be addressed and remediated as soon as practicable.

The difficult part of implementing this advice clearly results from questioning whether a single issue could affect the entire project, and the cost of immediate remediation vs. its cost. There is not an easy answer to this–I tend to believe that unless there is a strong argument showing why a single event cannot become part of a failure chain, then it becomes something that should be fixed now. Alternatively if the cost of immediate remediation is seen as less than the cost of failure, then the issue can be safely put aside–but not ignored–for the time being.

To put this into perspective in our line of work:

Let’s imagine a system to be delivered that provides web-based consumer access to a catalog of items.

Let’s further imagine that the following are true:

  • The catalog data is loaded into the system database using a CSV export of data from another system of ancient vintage.
  • Some of the data imported goes into text fields.
  • Those text fields are directly used by the services layer.
  • Some of those text fields determine specific execution paths through the service layer code.
  • That service code assumes the execution paths can be completely specified at design time.
  • The UI layer is designed assuming that delivery of catalog data for display will be “browser safe”–i.e., no characters that will not display as intended.

This is a simple example, and over-constrained, but I think you can see where this is going.

If the source system has data, to be placed in the target system text fields. has characters that are not properly handled by the services layer and/or the UI layer, bad actions are likely to result.

For instance, some older systems permit the use of text documents produced in MSWord that promote raw single- and double-quote characters to “curly versions” and take the resulting Unicode data in raw form. Downstream this might result in failure within the service layer or improper display in the UI layer.

Most of us, as experienced IT professionals, would likely never let this happen. We would sanitize the data at some point in the process, and/or provide protections in the service/UI layers to prevent such data from producing unacceptable outcomes.

But, for a moment, I want you to think of this as less than an argument for “defense in depth” programming. I want you to think of it as taking each step of the process outlined above as a separate item without knowing how each builds to the ultimate, undesirable outcome, and deciding to mitigate it on the basis of the simple possibility that it might cause a problem.

For example, if the engineer responsible for coding the CSV import process says “the likelihood of having problems with bad data can be ignored or taken care of in the services layer”, my suggested answer would be “you cannot be sure of that, and if we cannot be sure it won’t happen, you need to code against it”.

And, I would give the same answer to the services layer engineer who says “the CSV process will deal with any such issues”. You need to code against it.

It may sound like I’m simply suggesting that “defensive coding” is a good idea–and it is. But–and perhaps the example given is too easy–I would argue that the general idea I am suggesting is that you need to have a mindset that removes each and every item in a possible failure chain without knowing, for certain, that it could be a problem.

This suggestion is not without its drawbacks, and I would encourage you to provide your thoughts, pro or con, in the comments section of this blog.

In the meantime, I’ll be over here watching another disaster documentary….

Categories
SogetiLabs Posted

Thoughts on autonomous vehicles

The advent of autonomous vehicles, and in particular for personal use, has already had measurable impacts on our societies.

The impacts can be seen in a number of areas:

  • New road construction, and updates to existing road construction, now take into account the need to provide supporting infrastructure for autonomous vehicles.[1]
  • Most US states, and many countries, have put into place and continue to update standards, regulations, and programs to support the use of autonomous vehicles on public roads.[2]

The work is in these areas is well-covered in a number of websites and posts: see the notes at the end of this blog for some links.

I am interested here in sharing some thoughts I’ve had that may not have been well-covered in the literature but which are of interest to me in terms of the positive and negative impacts these may have on infrastructure, culture, and how we handle the new legal issues that would arise.

Infrastructure should change

The current roadway infrastructure in most of the world is predicated on the behaviors expected of human drivers and attempt to minimize the opportunity for accidents while maximizing “throughput” of the system. This is best seen in the age-old problem of dealing with intersecting roadways, where traffic control of some kind must be instituted to avoid collisions and allow pedestrians to cross safely (where appropriate). This is usually accomplished by some combination of stop signs, yield signs, traffic signals, and traffic regulations controlling expected behavior at such intersections.

With such an existing infrastructure, massive when considered on a world-wide basis, and the fact that autonomous vehicles must mix with human-driven vehicles, it’s not surprising that most autonomous vehicles are programmed to live within existing infrastructures and rules.

For example, current autonomous vehicles that operate on surface streets are expected to recognize and properly respond to traffic signals and traffic flow control signage. They must, as there are human drivers that flow with them and abide by the same rules.

Further, as current autonomous vehicles are generally lower than class 5 (“full driving automation”–i.e., no steering wheel or manual brakes), until these are the majority (or all) of on-the-road vehicles, this is not likely to change.

But let’s imagine for a moment what could (and perhaps should) change if no vehicles other than class 5 are allowed on some, most, or all public roadways.

At this point, the restrictions that apply when there are non-autonomous vehicles in the mix may be completely eliminated or severely restricted in location or extent.

For instance, intersections would no longer need infrastructure-related flow control mechanisms–the vehicles themselves could contain this information. In-vehicle street maps would contain specifics on the details of an intersection (width, number of lanes, turn vs straight-ahead lanes, etc.) that would be needed to make stop/go/speed decisions. Additionally, with the inclusion of between-vehicle networking, the cars and trucks could decide dynamically who should have the right-of-way, at what speed the intersection should be traversed, and so on to maintain the optimal flow rate of the traffic.

Some interesting work has been done in this area by a team at Cornell University [3] in which they modeled the idea of having “platoons” of vehicles–tightly clustered cars that travel together–that are allowed to pass through an intersection while cross-traffic is held. This differs from human-piloted vehicles in that the cycle time of the stoplights can be faster to provide shorter hold times for the cross-traffic. Throughput increases of up to 138% are possible with this approach.

Let’s go a little further with this idea. What if the following were true?

  • Traffic lights were removed/deprecated at an intersection.
  • All autonomous vehicles approaching an intersection were in constant communication.
  • Common decision algorithms based on dynamic inputs (other vehicles, weather, desired flow rate, existence of approaching pedestrians, etc.) were used by all vehicles.
  • Platoon sizes as low as one vehicle were permitted.

I can foresee–though I have no direct proof through modeling, for instance–that even larger throughput increases could be achieved.

Certainly one obvious advantage to be achieved from such an approach would be the removal of the need to install and maintain traffic control mechanism–bulbs that need to be replace, signs that must be replaced after being damaged, etc. Infrastructure costs could be reduced or, at the least, be available for redirection to more meaningful efforts such as the maintenance of roadways themselves.

This approach is not without its downsides or costs, of course.

  • There must be industry-wide agreement on decision standards for the dynamic flow control of such vehicles.
  • There cannot be more than a few, if any, non-fully-autonomous vehicles on the roadway.
  • Legal issues around where responsibility would lie for any problems would have to be resolved and codified.

But, all in all, I would argue that conversion of the flow control systems in public roadways to an in-vehicle, shared system would permit maximal use of public roadways in the world in which human-operated vehicles are the exception rather than the rule.


[1] Many countries are establishing standards for roadway markings, for instance, to meet the needs of autonomous vehicles. These include reflectivity standards for paints, minimum widths for lane markers, etc. See this interesting article for some background.

[2] Ibid

[3] https://news.cornell.edu/stories/2019/12/smart-intersections-could-reduce-autonomous-car-congestion