Data wrangling for data science and AI with Discovery Hub®

Data wrangling
for data science
and AI

What is data wrangling?

Data wrangling is the process of selecting, gathering, and transforming data to answer an AI question

The need for data wrangling tools

Based on a 2016 survey of Data Scientists, it is commonly accepted that data wrangling costs data scientists as much as 80% of their time, leaving only 20% for data modeling and machine learning.

According to a survey of CIO’s conducted by IDG’s CIO Research Services, 98 percent of those surveyed believe preparation and aggregation of large datasets in a timely fashion is a major challenge. Plus 96 percent report challenges with data exploration and iterative model training with large sets of data.

And according to that same survey, while nearly 90% of organizations are investing in AI – only 1 in 3 projects are successful.

Why is data wrangling such a challenge?

AI tools have focused on automating machine learning and other processes for building models, but many data scientists still must write code to aggregate, cleanse and prepare data for use by these tools. Simplification and automation are needed for the data wrangling process to shorten the time it takes to move from framing a question to finding an answer (time to value).

Discovery Hub improves data wrangling to
speed time to value

Benefits of data wrangling with Discovery Hub

W

Perform data discovery without needing access to source systems

W

Filter, group, join, and aggregate raw data from multiple sources for easy access by data scientists

W

Enforce and measure data quality through easy to use features

W

Schedule & maintain incremental refreshes of your data lake from 100+ web apps, databases & file types

Discovery Hub builds your data lake from multiple data sources

Since data discovery and mining is off-loaded from source systems, these systems benefit from improved performance for optimal day-to-day operations. And data scientists can access all data in a single location.

  • Increased security: Manage security through a single connection string instead of numerous groups managing security for each sources system
  • Increased performance: Since fewer users and analytical tools are querying and accessing source systems, performance of operational systems is not degraded by analytics and AI.
X