Building for the Long-Term: Considerations for IoT Updates

Posted by François Baldassari on April 9, 2021 at 21:45 in Devices

From a salt shaker with a built-in speaker to smart water devices that bring clean water to communities with weak infrastructure, connected devices are increasingly advancing into all areas of our lives. But more connectivity brings more possibilities for crippling issues that can impact product development, operations, and maintenance. IoT developers must consider how to plan for firmware architecture that leads to a better, stickier product.

Competition among connected device manufacturers is swelling in every corner of the industry, and user patience for clunky products won’t get the benefit of the doubt that developers might otherwise have had in the IoT’s nascent days. As users become more dependent on connected devices, consumer demands that those devices consistently function well - and securely - become the expectation. There remains, of course, work to be done: a quick Google search reveals stories like the Fitbit firmware update that destroyed the device battery, or the Tesla key fobs that could be overwritten and hijacked until a patch was rolled out.

These stories underscore that the IoT ecosystem’s connected nature requires that hardware developers approach product development differently - and take firmware updates seriously. It used to be that developers could write static firmware for specific device use cases or commoditized products and, once released, have no further interaction or engagement with the product. That system no longer works. To have a successful product, IoT device manufacturers need to invest in design and in firmware development equally.

Whether it’s BLE on phones or LTE or Zigby and other mesh networks, IoT devices are connected, regularly transmitting sensitive and personal data to and from the cloud. The near limitless reach of modern connected devices across all areas of our lives, paired with the high price point of most IoT devices underscores that IoT developers must have a plan (and not an after-the-fact reaction) for firmware maintenance. Putting that plan in motion requires three considerations:

Device monitoring

Ubiquitous connectivity brings with it major challenges, but it also brings opportunities. Among other things, it allows automated device health monitoring. The typical process of releasing a product relies on users’ reporting a problem and requiring them to physically return the device to be evaluated, repaired, and returned. Simply put, this is a huge waste of money and time, and it also risks frustrating the customer to the point of losing them entirely. Using customers as your testers is simply a terrible business decision. (Maybe you could get away with it if you were the only game in town, but IoT device makers don’t have that luxury anymore). Automated device monitoring is the solution. By regularly analyzing the health of devices and flagging potential problems immediately, a monitoring system can help device makers catch and fix issues in hours that would have otherwise taken them weeks to root cause. Designing embedded systems with such capabilities gives critical observability into performance, stability, and overall health - either of a single device or of a fleet of millions.

Repair

Shipping products that require an update or patch is inevitable for even the most talented and thorough teams. Just ask NASA. While no one can avoid updates entirely, it is possible to detect fleet-wide issues and solve them without burdening users. The key is to roll out updates incrementally, starting with a small number of devices and ramping up over time. This limits the impact of any new issues and insulates most of your users from the churn of getting a few bugfix releases in a row. Another good option is to implement an A/B update system if you have enough flash memory. This allows your device to download an update in the background with no user impact and simply prompts the user to reboot once the update is ready. Fast and simple update flows like A/B updates are key to compliance, and prevent too much fragmentation across your fleet. Last but not least, it is important to pair regular updates with a monitoring system so you can quickly identify problems with the update, and rollouts can be paused or aborted altogether.

Building with security in mind

The ubiquity of IoT devices has accelerated customer demands for robust device security in lockstep, with regulatory bodies becoming more serious (and punitive) about security requirements and standards. For those building smart devices, I would offer these principles as table stakes for security:

Devices must be updateable.
Trusted boot is no longer optional. You need a chain of trust to control the firmware running on your device.
Rotate secrets and don’t use a master secret. Whether that means a set of encryption keys or other secrets to make devices functional, they must be dynamically changed, so the compromise of one device does not lead to the compromise of others.

Software teams have long embraced iterative processes, and IoT device developers can learn much from this process. Focusing on firmware architecture that is responsive, observable, and proactive, lets device manufacturers ship a better product and create a happier customer base.

François Baldassari is the Founder and CEO of Memfault, a cloud-based observability platfrom for hardware devices. Prior to Memfault, François worked on developer infrastructure initiatives at Pebble and Oculus.

Can AI Replace Firmware?

Posted by Jack Ganssle on September 14, 2020 at 16:30 in Software, Programming, and Management

Scott Rosenthal and I go back about a thousand years; we've worked together, helped midwife the embedded field into being, had some amazing sailing adventures, and recently took a jaunt to the Azores just for the heck of it. Our sons are both big data people; their physics PhDs were perfect entrees into that field, and both now work in the field of artificial intelligence.

At lunch recently we were talking about embedded systems and AI, and Scott posed a thought that has been rattling around in my head since. Could AI replace firmware?

Firmware is a huge problem for our industry. It's hideously expensive. Only highly-skilled people can create it, and there are too few of us.

What if an AI engine of some sort could be dumped into a microcontroller and the "software" then created by training that AI? If that were possible - and that's a big "if" - then it might be possible to achieve what was hoped for when COBOL was invented: programmers would no longer be needed as domain experts could do the work. That didn't pan out for COBOL; the industry learned that accountants couldn't code. Though the language was much more friendly than the assembly it replaced, it still required serious development skills.

But with AI, could a domain expert train an inference engine?

Consider a robot: a "home economics" major could create scenarios of stacking dishes from a dishwasher. Maybe these would be in the form of videos, which were then fed to the AI engine as it tuned the weighting coefficients to achieve what the home ec expert deems worthy goals.

My first objection to this idea was that these sorts of systems have physical constraints. With firmware I'd write code to sample limit switches so the motors would turn off if at an end-of-motion extreme. During training an AI-based system would try and drive the motors into all kinds of crazy positions, banging destructively into stops. But think how a child learns: a parent encourages experimentation but prevents the youngster from self-harm. Maybe that's the role of the future developer training an AI. Or perhaps the training will be done on a simulator of some sort where nothing can go horribly wrong.

Taking this further, a domain expert could define the desired inputs and outputs, and then a poorly-paid person do the actual training. CEOs will love that. With that model a strange parallel emerges to computation a century ago: before the computer age "computers" were people doing simple math to create tables of logs, trig, ballistics, etc. A room full all labored at a problem. They weren't particularly skilled, didn't make much, but did the rote work under the direction of one master. Maybe AI trainers will be somewhat like that.

Like we outsource clothing manufacturing to Bangladesh, I could see training, basically grunt work, being sent overseas as well.

I'm not wild about this idea as it means we'd have an IoT of idiots: billions of AI-powered machines where no one really knows how they work. They've been well-trained but what happens when there's a corner case?

And most of the AI literature I read suggests that inference successes of 97% or so are the norm. That might be fine for classifying faces, but a 3% failure rate of a safety-critical system is a disaster. And the same rate for less-critical systems like factory controllers would also be completely unacceptable.

But the idea is intriguing.

Original post can be viewed here

Feel free to email me with comments.

Back to Jack's blog index page.

How Good Does Firmware Have to Be?

Posted by Jack Ganssle on May 1, 2020 at 16:14

By Jack Ganssle

As Good As It Gets

How good does firmware have to be? How good can it be? Is our search for perfection, or near-perfection an exercise in futility?

Complex systems are a new thing in this world. Many of us remember the early transistor radios which sported a half dozen active devices, max. Vacuum tube televisions, common into the 70s, used 15 to 20 tubes, more or less equivalent to about the same number of transistors. The 1940s-era ENIAC computer required 18,000 tubes, so many that technicians wheeled shopping carts of spares through the room, constantly replacing those that burned out. Though that sounds like a lot of active elements, even the 25 year old Z80 chip used a quarter of that many transistors, in a die smaller than just one of the hundreds of thousands of resistors in the ENIAC.

Now the Pentium IV, merely one component of a computer, has 45 million transistors. A big memory chip might require a third of a billion. Intel predicts that later this decade their processors will have a billion transistors. I'd guess that the very simplest of embedded systems, like an electronic greeting card, requires thousands of active elements.

Software has grown even faster, especially in embedded applications. In 1975 10,000 lines of assembly code was considered huge. Given the development tools of the day - paper tape, cassettes for mass storage, and crude teletypes for consoles - working on projects of this size was very difficult. Today 10,000 lines of C - representing perhaps 3 to five times as much assembly - is a small program. A cell phone might contain a million lines of C or C++, astonishing considering the device's small form factor and miniscule power requirements.

Another measure of software size is memory usage. The 256 byte (that's not a typo) EPROMs of 1975 meant even a measly 4k program used 16 devices. Clearly, even small embedded systems were quite pricey. Today? 128k of Flash is nothing, even for a tiny app. The switch from 8 to 16 bit processors, and then from 16 to 32 bitters, is driven more by addressing space requirements than raw horsepower.

In the late 70s Seagate introduced the first small Winchester hard disk, a 5 Mb 10 pound beauty that cost $1500. 5 Mb was more disk space than almost anyone needed. Now 20 Gb fits into a shirt pocket, is almost free, and fills in the blink of an eye.

So, our systems are growing rapidly in both size and complexity. And, I contend, in failure modes. Are we smart enough to build these huge applications correctly?

It's hard to make even a simple application perfect; big ones will possibly never be faultless. As the software grows it inevitably becomes more intertwined; a change in one area impacts other sections, often profoundly. Sometimes this is due to poor design; often, it's a necessary effect of system growth.

The hardware, too, is certainly a long way from perfect. Even mature processors usually come with an errata sheet, one that can rival the datasheet in size. The infamous Pentium divide bug was just one of many bugs - even today the Pentium 3's errata sheet (renamed "specification update") contains 83 issues. Motorola documents nearly a hundred problems in the MPC555.

I salute the vendors for making these mistakes public. Too many companies frustrate users by burying their mistakes.

What is the current state of the reliability of embedded systems? No one knows. It's an area devoid of research. Yet a lot of raw data is available, some of which suggests we're not doing well.

The Mars Pathfinder mission succeeded beyond anyone's dreams, despite a significant error that crashed the software during the lander's descent. A priority inversion problem - noticed on Earth but attributed to a glitch and ignored - caused numerous crashes. A well-designed watchdog timer recovery strategy saved the mission. This was a very instructive failure as it shows the importance of adding external hardware and/or software to deal with unanticipated software errors.

The August 15, 2001 issue of the Journal of the American Medical Association contained a study of recalls of pacemakers and implantable cardioverter-defibrillators. (Since these devices are implanted subcutaneously I can't imagine how a recall works). Surely designers of these devices are on the cutting edge of building the very best software. I hope. Yet between 1990 and 2000 firmware errors accounted for about 40% of the 523,000 devices recalled.

Over the ten years of the study, of course, we've learned a lot about building better code. Tools have improved and the amount of real software engineering that takes place is much greater. Or so I thought. Turns out that the annual number of recalls between 1995 and 2000 increased.

In defense of the pacemaker developers, no doubt they solve very complex problems. Interestingly, heart rhythms can be mathematically chaotic. A slight change in stimulus can cause the heartbeat to burst into quite unexpected randomness. And surely there's a wide distribution of heart behavior in different patients.

Perhaps a QA strategy for these sorts of life-critical devices should change. What if the responsible person were one with heart disease! who had to use the latest widget before release to the general public?

A pilot friend tells me the 747 operator's manual is a massive tome that describes everything one needs to know about the aircraft and its systems. He says that fully half of the book documents avionics (read: software) errors and workarounds.

The Space Shuttle's software is a glass half-empty/half-full story. It's probably the best code ever written, with an average error rate of about one per 400,000 lines of code. The cost: $1000 per line. So, it is possible to write great code, but despite paying vast sums perfection is still elusive. Like the 747, though, the stuff works "good enough", which is perhaps all we can ever expect.

Is this as good as it gets?

The Human Factor

Let's remember we're not building systems that live in isolation. They're all part of a much more complex interacting web of other systems, not the least of which is the human operator or user. When tools were simple - like a hammer or a screwdriver - there weren't a lot of complex failure modes. That's not true anymore. Do you remember the USS Vincennes? She is a US Navy battle cruiser, equipped with the incredibly sophisticated Aegis radar system. In July, 1988 the cruiser shot down an Iranian airliner over the the target wasn't an incoming enemy warplane, but the data was displayed on a number of terminals that weren't easy to see. So here's a failure where the system worked as designed, but the human element created a terrible failure. Was the software perfect since it met the requirements?

Unfortunately, airliners have become common targets for warplanes. This past October a Ukrainian missile apparently shot down a Sibir Tu-154 commercial jet, killing all 78 passengers and crew. As I write the cause is unknown, or unpublished, but local officials claim the missile had been targeted on a close-by drone. It missed, flying 150 miles before hitting the jet. Software error? Human error?

The war in Afghanistan shows the perils of mixing men and machines. At least one smart bomb missed its target and landed on civilians. US military sources say wrong target data was entered. Maybe that means someone keyed in wrong GPS coordinates. It's easy to blame an individual for mistyping! but doesn't it make more sense to look at the entire system as a whole, including bomb and operator? Bombs have pretty serious safety-critical aspects. Perhaps a better design would accept targeting parameters in a string that includes a checksum, rather like credit card numbers. A mis-keyed entry would be immediately detected by the machine.

It's well-known that airplanes are so automated that on occasion both pilots have slipped off into sleep as the craft flies itself. Actually, that doesn't really bother me much, since the autopilot beeps when at the destination, presumably waking the crew. But, before leaving the fliers enter the destination in latitude/longitude format into the computers. What if they make a mistake (as has happened)? Current practice requires pilot and co-pilot to check each other's entries, which will certainly reduce the chance of failure. Why not use checksummed data instead and let the machine validate the data?

Another US vessel, the Yorktown, is part of the Navy's "Smart Ship" initiative. Hugely automating the engineering (propulsion) department reduces crew needs by 10% and saves some $2.8 million per year on this one ship. Yet the computers create new vulnerabilities. Reports suggest that an operator entered an incorrect parameter which resulted in a divide-by-zero error. The entire network of Windows NT machines crashed. The Navy claims the ship was dead in the water for about three hours; other sources claim it was towed into port for two days of system maintenance. Users are now trained to check their parameters more carefully. I can't help wonder what happens in the heat of battle, when these young sailors may be terrified, with smoke and fire perhaps raging. How careful will the checks be?

Some readers may also shudder at the thought of NT controlling a safety-critical system. I admire the Navy's resolve to use a commercial, off the shelf product, but wonder if Windows, which is the target of every hacker's wrath, might not itself create other vulnerabilities. Will the next war be won by the nation with the best hackers?

A plane crash in Florida, in which software did not contribute to the disaster, was a classic demonstration of how difficult it is to put complicated machines in the hands of less-than-perfect people. An instrument lamp burned out. It wasn't an important problem, but both pilots became so obsessed with tapping on the device they failed to notice that the autopilot was off. The plane very gently descended till it crashed, killing everyone.

People will always behave in unpredictable ways, leading to failures and disasters with even the best system designs. As our devices grow more complex their human engineering becomes ever more important. Yet all too often this is neglected in our pursuit of technical solutions.

Solutions?

I'm a passionate believer in the value of firmware standards, code inspections, and a number of other activities characteristic of disciplined development. It's my observation that an ad hoc or a non-existent process generally leads to crummy products. Smaller systems can succeed from the dedication of a couple of overworked experts, but as things scale up in size heroics becomes less and less successful.

Yet it seems an awful lot of us don't know about basic software engineering rules. When talking to groups I usually ask how many participants have (and use) rules about the maximum size of a function. A basic rule of software engineering is to limit routines to a page or less. Yet only rarely does anyone raise their hand. Most admit to huge blocks of code, sometimes thousands of lines. Often this is a result of changes and revisions, of the code evolving over the course of time. Yet it's a practice that inevitably leads to problems.

By and large methodologies have failed. Most are too big, too complex, or too easy to thwart and subvert. I hold great hopes for UML, which seems to offer a way to build products that integrates hardware and software, and that is an intrinsic part of development from design to implementation. But UML will fail if management won't pay for quite extensive training, or toss the approach when panic reigns.

The FDA, FAA, and other agencies are slowing becoming aware of the perils of poor software, and have guidelines that can improve development. Britain's MISRA (Motor Industry Software Reliability Association) has guidelines for the safer use of C. They feel that we need to avoid certain constructs and use others in controlled ways to eliminate potential error sources. I agree. Encouragingly, some tool vendors (notably Tasking) offer compilers that can check code against the MISRA standard. This is a powerful aid to building better code.

I doubt, though, that any methodology or set of practices can, in the real world of schedule pressures and capricious management, lead to perfect products. The numbers tell the story. The very best use of code inspections, for example, will detect about 70% of the mistakes before testing begins. (However, inspections will find those errors very cheaply). That suggests that testing must pick up the other 30%. Yet studies show that often testing checks only about 50% of the software!

Sure, we can (and must) design better tests. We can, and should, use code coverage tools to insure every execution path runs. These all lead to much better products, but not to perfection. Because all of the code is known to have run doesn't mean that complex interactions between inputs won't lead to bizarre outputs. As the number of decision paths increases - as the code grows - the difficulty of creating comprehensive tests skyrockets.

When time to market dominates development, quality naturally slips. If low cost is the most important parameter, we can expect more problems to slip into the product.

Software is astonishingly fragile. One wrong bit out of a hundred million can bring a massive system down. It's amazing that things work as well as they do!

Perhaps the nature of engineering is that perfection itself is not really a goal. Products are as good as they have to be. Competition is a form of evolution that often does lead to better quality. In the 70s Japanese automakers, who had practically no US market share, started shipping cars that were reliable and cheap. They stunned Detroit, which was used to making a shoddy product which dealers improved and customers tolerated. Now the playing field has leveled, but at an unprecedented level of reliability.

Perfection may elude us, but we must be on a continual quest to find better ways to build our products. Wise developers will spend their entire careers engaged in the search.

Originally posted here.

For novel ideas about building embedded systems (both hardware and firmware), join the 35,000 engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe

Over-The-Air Firmware: The Critical Driver of IoT Success

Posted by David Oro on January 2, 2018 at 12:40 in Programming

Guest post by Jeffrey Lee.

In the early days of IoT, updating remote devices often caused intermittent disruption and performance degradation. As IoT platforms have matured, they have embraced a novel way to remotely and reliably update connected devices with little to no disruption: over-the-air (OTA) firmware updates.

Over-the-air firmware updates refers to the practice of remotely updating the code on an embedded device. The embedded hardware must be built with OTA functionality for this mechanism to work.

Why OTA Firmware?

Prior to OTA updates, you had to go out and retrieve the device, take it apart, connect it to your computer, reprogram it, put the device back together, and then take the device back.

However, this process is overly burdensome and unscalable for companies who have devices out on the field. Although, it hasn’t stopped some from trying . . .

In 2015, Chrysler was criticized for patching a software vulnerability via mailed USB drives. Chrysler’s method put many consumers at risk because the USB drives could be intercepted, modified, and resent.

On the other hand,

In 2016, Tesla drivers woke up to find substantial new features to their car after the company sent out an OTA firmware update. Consumers could now self-park their cars without having to manually update their vehicles.

You tell us which is the better headline.

OTA Firmware Benefits

Bugs and product behavior can be continuously improved even after the device is in the hands of your consumers.
Companies can test new features by sending updates to one or multiple devices.
Companies can save costs by managing the firmware across their fleet of devices from a seamless, unified interface.
Developers can deploy frequently and reliably, knowing that products will stay functional as updates are released.
OTA firmware augments scalability by adding new features and infrastructure to products after they are released.

OTA Firmware & Device Management

To send out OTA firmware updates, you need a device management system that can interface with microprocessors and local software on IoT devices. This is complicated to build because few companies have an IoT software and hardware ecosystem that can process OTA firmware updates and manage remote devices.

Implementing OTA firmware updates

There are two options companies can take: you can build your own OTA firmware system or buy a managed OTA firmware system. For the build route, it is imperative that you research, plan, and consult domain experts to help you add OTA functionality to your hardware and software. Implementing the proper industry encryptions, finding the compatible hardware/software, and finding domain experts who can actually help you will be some of your biggest concerns.

However, due to the complexities of transmitting of the data and security concerns, you could harness a pre-built managed platform solution like Particle.

Getting Started with Particle and OTA firmware

Particle is a full stack IoT platform that offers the hardware and software tools to connect everyday electronics to the internet. Part of this platform, Particle cloud and console, also allows consumers to control fleets of devices and products with wireless firmware updates. Here are some of the benefits of using Particle for OTA firmware updates:

Future-proof your products knowing that Particle is taking care of the infrastructure, hardware, and software.
OTA firmware updates are sent in chunks so your device won’t brick. If your device loses connection during the update process, it’ll just resume when the connection comes back online.
Firmware updates are delivered quickly because the update is just sent to the application layer and not the system layer. Particle only pushes parts of the application that have changed to the device.
Easily scale from sending OTA firmware updates from 1 to 1,000 devices without hardware scalability or software issues.
Test application updates by sending firmware updates to one or a controlled group of devices.
Deliver updates securely knowing all communication channels between the device and Particle cloud are fully encrypted and authorized.
Document each release throughly via Particle console to provide your team a comprehensive picture of what has changed in each version.
Devices can be set into safe mode so it doesn’t execute any application code, which can be useful if new application code contains bugs that stop the device from connecting to the cloud.

All and all

OTA firmware is the critical driver for IoT success because it is powering the reliability and scalability of connected devices. Companies must decide whether building their own OTA firmware system is worth the time and potential costs, or if purchasing a platform that has OTA firmware functionality is a more efficient and effective way to update remote wireless devices.

This post originally appeared here.

firmware (4)

Building for the Long-Term: Considerations for IoT Updates

Device monitoring

Repair

Building with security in mind

Can AI Replace Firmware?

How Good Does Firmware Have to Be?

By Jack Ganssle

As Good As It Gets

The Human Factor

Solutions?

Over-The-Air Firmware: The Critical Driver of IoT Success

Why OTA Firmware?

OTA Firmware Benefits

OTA Firmware & Device Management

Implementing OTA firmware updates

Getting Started with Particle and OTA firmware

All and all

Note: this page contains paid content.

Sponsor

IoT Central

Workshops

Webinars