Internet of Things: Where Does the Data Go?


The Internet of Things means different things to different people. To vendors, it’s the latest in a slew of large-scale trends to affect their enterprise customers, and the latest marketing bandwagon they have to consider. To enterprise organizations, it’s still a jumble of technical standards, conflicting opinions and big potential. For developers, it’s a big opportunity to put together the right mix of tools and technologies, and probably something they are already doing under another name. Understanding how these technologies work together on a technical level is becoming important, and will provide more opportunities to use software design as part of the overall business.


As Internet of Things projects go from concepts to reality, one of the biggest challenges is how the data created by devices will flow through the system. How many devices will be creating information? How will they send that information back? Will you be capturing that data in real time, or in batches? What role will analytics play in future?


These questions have to be asked in the design phase. From the organizations that I have spoken to, this preparation phase is essential to make sure you use the right tools from the start.


Sending the Data


It is helpful to think about the data created by a device in three stages. Stage one is the initial creation, which takes place on the device, and then sent over the Internet. Stage two is how the central system collects and organizes that data. Stage three is the ongoing use of that data for the future.


For smart devices and sensors, each event can and will create data. This information can then be sent over the network back to the central application. At this point, one must decide which standard the data will be created in and how it will be sent over the network. For delivering this data back, MQTT, HTTP and CoAP are the most common standard protocols used. Each of these has its benefits and use cases.


HTTP provides a suitable method for providing data back and forth between devices and central systems. Originally developed for the client-server computing model, today it supports everyday web browsing through to more specialist services around Internet of Things devices too. While it meets the functionality requirements for sending data, HTTP includes a lot more data around the message in its headers. When you are working in low bandwidth conditions, this can make HTTP less suitable.


MQTT was developed as a protocol for machine-to-machine and Internet of Things deployments. It is based on a publish / subscribe model for delivering messages out from the device back to a central system that acts as a broker, where they can then be delivered back out to all of the other systems that will consume them. New devices or services can simply connect to the broker as they need messages. MQTT is lighter than HTTP in terms of message size, so it is more useful for implementations where bandwidth is a potential issue. However, it does not include encryption as standard so this has to be considered separately.


CoAP is another standard developed for low-power, low-bandwidth environments. Rather than being designed for a broker system like MQTT, CoAP is more aimed at one-to-one connections. It is designed to meet the requirements of REST design by providing a way to interface with HTTP, but still meet the demands of low-power devices and environments.


Each of these protocols support taking information or updates from the individual device and sending it over to a central location. However, where there is a greater opportunity is how that data is then stored and used in the future. There are two main concerns here: how the data is acted upon as it comes into the application, and how it is stored for future use.


Storing the Data


Across the Internet of Things, devices create data that is sent to the main application to be sent on, consumed and used. Depending on the device, the network and power consumption restraints, data can be sent in real time, or in batches at any time. However, the real value is derived from the order in which data points are created.


This time-series data has to be accurate for Internet of Things applications. If not, then it compromises the very aims of the applications themselves. Take telemetry data from vehicles. If the order of data is not completely aligned and accurate, then it points to potentially different results when analyzed. If a certain part starts to fail in particular conditions – for example, a temperature drop at the same time as a specific level of wear – then these conditions have to be accurately reflected in the data that is coming through, or it will lead to false results.


Time-series data can be created as events take place around the device and then sent. This use of real-time information provides a complete record for each device, as it happens. Alternatively, it can be collated as data is sent across in batches – the historical record of data will be there, it just isn’t available in real time. This is common with devices where battery life is a key requirement over the need for data to be delivered in real time. Either way, the fundamental requirement is that each transaction on each device is put in at the right time-stamp for sorting and alignment. If you are looking at doing this in real time with hundreds of thousands or potentially millions of devices, then write-speed at the database level is an essential consideration.


Each write has to be taken as it is received from the device itself and put into the database. For more traditional relational database technologies, this can be a limiting factor, as it is possible for write-requests to go beyond what the database was built for. When you have to have all the data from devices in order to create accurate and useful information, this potential loss can have a big impact. For the organizations that I have spoken to around Internet of Things projects, NoSQL platforms like Cassandra provide a better fit for their requirements.


Part of this is due to the sheer volume of writes that something like Cassandra is capable of; even with millions of devices that creating data all the time, the database is designed to ingest that much data as it is created. However, it is also due to how databases themselves are designed. Traditional databases have a primary-replica arrangement, where the lead database server will handle all the transactions and synchronously pass them along to other servers if required. This leads to problems in the event of an outage or server failure, as a new primary has to be put into place leading to a potential data loss.


For properly configured distributed database systems like Cassandra, there is no ‘primary’ server that is in charge; each node within a cluster can handle transactions as they come in, and the full record is maintained over time. Even if a server fails, or a node is removed due to loss of network connectivity, the rest of the cluster can continue to process data as it comes in. For time-series data, this is especially valuable as it means that there should be no loss of data in the list of transactions over time.


Analyzing the Data


Once you have this store of time-series data, the next opportunity is to look for trends over time. Analyzing time-series data provides the opportunity to create more value for the owners of the devices involved, or carry out automated tasks based on a certain set of conditions being met. The typical example is the Internet-connected fridge that realizes it is out of milk; however, Internet of Things data is more valuable when linked to larger private or public benefits, and with more complex condition sets that have to be met. Traffic analysis, utility networks and use of power across real estate locations are all concerned with consuming data from multiple devices in order to spot trends and save money or time.


In this environment, it’s helpful to think about when the results of the analytics will be required: is there an immediate, near real-time need for analysis, or is this a historic requirement? The popularity of Apache Spark for analysis of big data and Spark streaming for in near real time has continued to grow, and when combined with the likes of Cassandra it can provide developers with the ability to process and analyze large, fast-moving data sets alongside each other.


However, this is not just about what is taking place right now. The value from time-series data can come over time just as well. As an example, i2O Water in the UK looks at information around water pressure, taken from devices that are embedded in water distribution networks around the world. This data has been gathered over two years and is stored in a Cassandra cluster. The company uses this information for its analytics and to alert customers around where maintenance might be needed.


This data has its own value for the company. It has a ready-made source of modeling and analytics information for customers that can be used around new products too. This is down to the interesting way that the company has architected its applications in a modular fashion; when a new module or service is added, the time-series data can be “played” into the system as if the data was being created. This can then be used for analytics and to show how the devices on the water network would have reacted to the variations in pressure or other sensor data during that time.


For i2O Water, the opportunity here is to add services that demonstrate more value back to the utility companies that are customers. The value of water will only increase as more people need access, which in turn makes accurate and timely data more valuable. This is a good example of how connecting devices and data can improve lives as well as create new opportunities for the companies involved.


The ability to look back at time-series data has the most far-reaching consequences for the Internet of Things as a whole. Whether it’s for private sector gain or public sector good, the design of the application and how that data is stored over time is essential to understand. When designing for the Internet of Things, the role of distributed systems that can keep up with the sheer amount of data being created is also important.


Patrick McFadin is a Chief Evangelist for Apache Cassandra.



No comments:

Post a Comment