Data science, one of the hottest fields in tech today, has many dimensions and applications. As we all are familiar, in science, we can understand the features, behavior patterns and meaningful sights by formulating reusable and established formulas. Similarly, we can understand the behavior patterns, meaningful sights, through engineering and statistical methods from data as well. Hence, Data science can also be also viewed as Data + science, or the science of data.
- Data science has been in use across the Industries for decades and has many popular tools in place.
- Alteryx – Consists of a Designer module for designing analytics applications, a Server component for scaling across the organization and an Analytics Gallery for sharing applications with external partners.
- IBM – Provides SPSS Modeler, a tool targeted to users with little or no analytical background. IBM also has SPSS Statistics, which is geared towards more sophisticated analysts.
- KNIME – An open source product commercialized by software vendor KNIME.com that includes an analytics platform and a number of commercial extensions for big data, cluster operations, and collaboration.
- Microsoft Revolution Analytics – Spans two products — Revolution R Open, a free download that’s an enhanced version of the R programming language, and Revolution R Enterprise, which supports the use of R in clustered environments (like Hadoop).
- Oracle Advanced Analytics – Includes Oracle Data Miner, Oracle R Advanced Analytics for Hadoop and Oracle Big Data Discovery, as well as connectors and interfaces for SQL and R.
- RapidMiner – Provides a Studio component for design, a Server component, a Hadoop connector called Radoop and a component for stream processing.
- SAP Predictive Analytics – Comprises two versions, Automated Analytics (for business users without a formal background) and Expert Analytics (targeted to professional data analysts and data scientists).
- SAS Enterprise Miner – Intended to help users quickly develop descriptive and predictive models, including components for predictive modeling and in-database scoring.
- The Teradata Aster Discovery Platform – A framework offered by Teradata with its Aster database, Discovery Portfolio with built-in analytic functions, a graph processing engine, MapReduce and a version of R.
The recent surge of low-cost technology availability like Hadoop Eco systems, cloud computing, Big data and open source tools has led to a large-scale adoption by every industry ranging from small to large giants.
Few popular open source programming languages:
They offer all the popular functionality on par with statistical packages such as SPSS, SAS, and Stata.
- K-means Clustering
- Association Rule Mining
- Linear Regression
- Logistic Regression
- Naïve Bayesian Classifiers
- Decision Tree
- Time Series Analysis
- Text Analytics
- Big Data Processing
- Visual Workflows
Planning The Strategy.
Below are few of steps towards strategy
- Defining the business goal -> The most important criteria.
- Sponsorship -> Buy-in from key stakeholders.
- Collaboration -> Among data engineers and data scientists; very crucial for project success.
- Build a data lake or repository of valid, meaningful and useful historical data by gathering from different source systems -> Capacity Planning to be well considered.
- Data Quality-> Cleansing data as per the needed quality norms.
- Features extraction -> Meaningful insights from existing data for primary variables.
- Prediction Models -> Create models with feature engineering for derived features.
- Testing the models -> Following K-Fold Cross Validation or 70:30 models for thorough testing.
- Establish Model Results: The results accuracy validates the model effectiveness.
- Fine-tuning -> Continuous improvement of the model with new data.