It’s easy to get caught up in the current excitement about Artificial Intelligence and Machine Learning. Every day seems to bring announcements of new use cases, new techniques and better algorithms. Whether playing games, driving cars, diagnosing diseases or understanding human speech, developers are expanding the limits of what we once thought possible.
Among all this buzz it is easy to lose sight of one fundamental truth about AI - a truth that brings this seemingly novel technology right back to basics. Artificial Intelligence is all about data.
When the latest bot carries on a human conversation, or an advanced robot doctor diagnoses a disease, it is only possible because someone has painstakingly prepared, tagged and delivered the right data in the right format for the machines to learn from. And when machine learning fails - by not considering women for technical roles, or not understanding that a table for four may mean a table at four o’clock, and so on for many examples - these failures are often the result of incomplete, poor quality or badly managed data sets used to train the intelligence.
This is not just a rerun of the old saying "Garbage In, Garbage Out". It’s really a reflection of the very modern that one person’s garbage is another person’s gold, all depending on what the raw material is to be used for.
Data Is Our Raw Material
For machine learning and advanced analytic technologies this raw material, used for training and testing, comes from the data sets collected or generated by the business. Ultimately, no matter how good the algorithm, the data makes a great difference.
Indeed, when considering some of the more advanced algorithms, data scientists are not even sure in detail how they work. In such cases our confidence comes from observing that good data gives good answers.
So, even with these advanced and innovative projects we need to revisit a classic problem of information management: how to prepare and govern data efficiently for our business purpose. In this case, to complicate the picture, this purpose may be constantly changing and even changing because of the new data the model sees.
Rapid Development and Deployment
In this scenario, users in both IT teams provisioning data and Data Scientists consuming data, need tools which can source, shape and deliver data with great agility and assurance. Tools like Discovery Hub® provide these capabilities.
Agility comes from the technique of rapid and iterative data model development. A new version of the data structure can be created incrementally, deployed, loaded with data and sent live, without having to stop and rebuild the system.
Both IT teams and analysts have the assurance that the process is versioned, and therefore can be rolled back. In addition, the scheme and ETL processes are documented and audited with comprehensive management tools.
Managing the Data for Machine Learning
The result of these rapid-development techniques is a data infrastructure for modern needs. IT departments can deliver data into a dynamic estate of continuously evolving models. Yet they can govern this system with the confidence and reliability of a more traditional static model.
Data Scientists and consumers can meanwhile iterate over new data sets in both research and production with the understanding that these models can be assured and secured to the highest standards.
You may think of TimeXtender as a data warehousing tool, but it’s really showing the way as a data delivery and data estate tool for the latest and most competitive workloads.
Donald Farmer Thought Leadership Event
TimeXtender hosted a two-hour thought leadership event at our Bellevue office with the brilliant Donald Farmer. In case you were not able to attend, you can find a portion of the invigorating talk on our TimeXtender Youtube channel.