Draft:Data mining and warehousing

From Wikipedia, the free encyclopedia


Data mining


Introduction

Data mining is the process of extracting useful information from large sets of data. It involves using various techniques from statistics, machine learning, and database systems to identify patterns, relationships, and trends in the data. This information can then be used to make data-driven decisions, solve business problems, and uncover hidden insights.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989. However, the term ‘data mining’ became more popular in the business and press communities. Nowadays, data mining is used in almost all places where a large amount of data is stored and processed. For example, banks typically use ‘data mining’ to find out their prospective customers who could be interested in credit cards, personal loans, or insurance as well. Since banks have the transaction details and detailed profiles of their customers, they analyze all this data and try to find out patterns that help them predict that certain customers could be interested in personal loans, etc.

Definition:
“Data mining is the process of using statistical analysis and machine learning to discover hidden patterns, correlations, and anomalies within large datasets. This information can aid you in decision-making, predictive modeling, and understanding complex phenomena.”

Or


“The process of extracting information to identify patterns, trends, and useful data that would allow the business to take the data-driven decision from huge sets of data is called Data Mining.”

Data Mining as a Whole Process
The whole process of Data Mining consists of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection, and transformation takes place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting results

Applications of Data Mining
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis

KDD Process Steps

Knowledge discovery in the database process includes the following steps, such as:

1. Goal identification: Develop and understand the application domain and the relevant prior knowledge and identify the KDD process's goal from the customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of variables or data samples on which the discovery was made.
3. Data cleaning and preprocessing:Basic operations include removing noise if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields, and accounting for time sequence information and known changes.
4. Data reduction and projection: Finding useful features to represent the data depending on the purpose of the task. The effective number of variables under consideration may be reduced through dimensionality reduction methods or conversion, or invariant representations for the data can be found.
5. Matching process objectives: KDD with step 1 a method of mining particular. For example, summarization, classification, regression, clustering, and others.
6. Modeling and exploratory analysis and hypothesis selection: Choosing the algorithms or data mining and selecting the method or methods to search for data patterns. This process includes deciding which model and parameters may be appropriate and the matching of data mining methods, particularly with the general approach of the KDD process.
7. Data Mining: The search for patterns of interest in a particular representational form or a set of these representations, including classification rules or trees, regression, and clustering. The user can significantly aid the data mining method to carry out the preceding steps properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly returning to some of the steps between steps 1 and 7 for additional iterations. This step may also involve the visualization of the extracted patterns and models or visualization of the data given the models drawn.
9. Taking action on the discovered knowledge: Using the knowledge directly, incorporating the knowledge in another system for further action, or simply documenting and reporting to stakeholders. This process also includes checking and resolving potential conflicts with previously believed knowledge (or extracted).


Data Mining Techniques:

1. Association
Association is a technique used in data mining to identify the relationships or co-occurrences between items in a dataset. It involves analyzing large datasets to discover patterns or associations between items, such as products purchased together in a supermarket or web pages frequently visited together on a website. Association analysis is based on the idea of finding the most frequent patterns or itemsets in a dataset, where an itemset is a collection of one or more items.
2. Classification
Classification in data mining is a common technique that separates data points into different classes. It allows you to organize data sets of all sorts, including complex and large datasets as well as small and simple ones.
3. Prediction
Prediction is a data mining and machine learning technique that focuses on forecasting future outcomes or values based on patterns and relationships found in historical data. It involves using algorithms to analyze existing data and derive insights that can be used to make educated guesses about what might happen next.
4. Clustering
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more similar to each other than to data points in other groups.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data is used to predicting a continuous quantity for new observations. This classifier is also known as the Continuous Value Classifier. There are two types of regression models: Linear regression and multiple linear regression models.
6. Artificial Neural network (ANN) Classifier Method
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a process model supported by biological neural networks. It consists of an interconnected collection of artificial neurons. A neural network is a set of connected input/output units where each connection has a weight associated with it. During the knowledge phase, the network acquires by adjusting the weights to be able to predict the correct class label of the input samples.

7. Outlier Detection
Outlier is a data object that deviates significantly from the rest of the data objects and behaves in a different manner. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis of outlier data is referred to as outlier analysis or outlier mining.
8. Genetic Algorithm

Genetic algorithms are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of random search provided with historical data to direct the search into the region of better performance in solution space. They are commonly used to generate high-quality solutions for optimization problems and search problems. Genetic algorithms simulate the process of natural selection which means those species who can adapt to changes in their environment are able to survive and reproduce and go to the next generation. In simple words, they simulate “survival of the fittest” among individuals of consecutive generations for solving a problem. Each generation consist of a population of individuals and each individual represents a point in search space and possible solution. Each individual is represented as a string of character/integer/float/bits. This string is analogous to the Chromosome.

Advantages or Disadvantages:

Data mining is a powerful tool that offers many benefits across a wide range of industries. The following are some of the advantages of data mining:
1.Better Decision Making:
Data mining helps to extract useful information from large datasets, which can be used to make informed and accurate decisions. By analyzing patterns and relationships in the data, businesses can identify trends and make predictions that help them make better decisions.
2.Improved Marketing:
Data mining can help businesses identify their target market and develop effective marketing strategies. By analyzing customer data, businesses can identify customer preferences and behavior, which can help them create targeted advertising campaigns and offer personalized products and services.
3.Increased Efficiency:
Data mining can help businesses streamline their operations by identifying inefficiencies and areas for improvement. By analyzing data on production processes, supply chains, and employee performance, businesses can identify bottlenecks and implement solutions that improve efficiency and reduce costs.
4.Fraud detection
Data mining can be used to identify fraudulent activities in financial transactions, insurance claims, and other areas. By analyzing patterns and relationships in the data, businesses can identify suspicious behavior and take steps to prevent fraud.
5.Customer Retention:
Data mining can help businesses identify customers who are at risk of leaving and develop strategies to retain them. By analyzing customer data, businesses can identify factors that contribute to customer churn and take steps to address those factors.
6.Competitive Advantage:
Data mining can help businesses gain a competitive advantage by identifying new opportunities and emerging trends. By analyzing data on customer behavior, market trends, and competitor activity, businesses can identify opportunities to innovate and differentiate themselves from their competitors.
7.Improved Healthcare:
Data mining can be used to improve healthcare outcomes by analyzing patient data to identify patterns and relationships. By analyzing medical records and other patient data, healthcare providers can identify risk factors, diagnose diseases earlier, and develop more effective treatment plans.

Disadvantages Of Data mining:

While data mining offers many benefits, there are also some disadvantages and challenges associated with the process. The following are some of the main disadvantages of data mining:
1.Data Quality:
Data mining relies heavily on the quality of the data used for analysis. If the data is incomplete, inaccurate, or inconsistent, the results of the analysis may be unreliable.
2.Data Privacy and Security:
Data mining involves analyzing large amounts of data, which may include sensitive information about individuals or organizations. If this data falls into the wrong hands, it could be used for malicious purposes, such as identity theft or corporate espionage.
3.Ethical consideration
Data mining raises ethical questions around privacy, surveillance, and discrimination. For example, the use of data mining to target specific groups of individuals for marketing or political purposes could be seen as discriminatory or manipulative.
4.Technical Complexity:
Data mining requires expertise in various fields, including statistics, computer science, and domain knowledge. The technical complexity of the process can be a barrier to entry for some businesses and organizations.
5.Cost:
Data mining can be expensive, particularly if large datasets need to be analyzed. This may be a barrier to entry for small businesses and organizations.
6.Interpretation of Results:
Data mining algorithms generate large amounts of data, which can be difficult to interpret. It may be challenging for businesses and organizations to identify meaningful patterns and relationships in the data.
7.Dependence on Technology:
Data mining relies heavily on technology, which can be a source of risk. Technical failures, such as hardware or software crashes, can lead to data loss or corruption.

Problems, issues and challenges in DM:
Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues.
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.

Performance Issues:
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.
Diverse Data Types Issues
• Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.

Challenges:

Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data generated by individuals, organizations, and machines has grown exponentially. However, data mining is not without its challenges.

1]Data Quality
The quality of data used in data mining is one of the most significant challenges. The accuracy, completeness, and consistency of the data affect the accuracy of the results obtained. The data may contain errors, omissions, duplications, or inconsistencies, which may lead to inaccurate results. Moreover, the data may be incomplete, meaning that some attributes or values are missing, making it challenging to obtain a complete understanding of the data. Data quality issues can arise due to a variety of reasons, including data entry errors, data storage issues, data integration problems, and data transmission errors. To address these challenges, data mining practitioners must apply data cleaning and data preprocessing techniques to improve the quality of the data. Data cleaning involves detecting and correcting errors, while data preprocessing involves transforming the data to make it suitable for data mining.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as sensors, social media, and the internet of things (IoT). The complexity of the data may make it challenging to process, analyze, and understand. In addition, the data may be in different formats, making it challenging to integrate into a single dataset. To address this challenge, data mining practitioners use advanced techniques such as clustering, classification, and association rule mining. These techniques help to identify patterns and relationships in the data, which can then be used to gain insights and make predictions.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The data may contain personal, sensitive, or confidential information that must be protected. Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules on how data can be collected, used, and shared. To address this challenge, data mining practitioners must apply data anonymization and data encryption techniques to protect the privacy and security of the data. Data anonymization involves removing personally identifiable information (PII) from the data, while data encryption involves using algorithms to encode the data to make it unreadable to unauthorized users.

4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of the dataset increases, the time and computational resources required to perform data mining operations also increase. Moreover, the algorithms must be able to handle streaming data, which is generated continuously and must be processed in real-time. To address this challenge, data mining practitioners use distributed computing frameworks such as Hadoop and Spark. These frameworks distribute the data and processing across multiple nodes, making it possible to process large datasets quickly and efficiently.
5]interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is because the algorithms use a combination of statistical and mathematical techniques to identify patterns and relationships in the data. Moreover, the models may not be intuitive, making it challenging to understand how the model arrived at a particular conclusion. To address this challenge, data mining practitioners use visualization techniques to represent the data and the models visually. Visualization makes it easier to understand the patterns and relationships in the data and to identify the most important variables.

6]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of data. The data may be used to discriminate against certain groups, violate privacy rights, or perpetuate existing biases. Moreover, data mining algorithms may not be transparent, making it challenging to detect biases or discrimination

Data Mining Applications:
Here is the list of areas where data mining is widely used −
• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining. Some of the typical cases are as follows −
• Design and construction of data warehouses for multidimensional data analysis and data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.

Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and popularity of the web. Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission, etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business. Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. Here is the list of examples for which data mining improves telecommunication services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.

Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is a very important part of Bioinformatics. Following are the aspects in which data mining contributes for biological data analysis −
• Semantic integration of heterogeneous, distributed genomic and proteomic databases.
• Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.

Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications of data mining in the field of Scientific Applications −
• Data Warehouses and data preprocessing.
• Graph-based mining.
• Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity, security has become the major issue. With increased usage of internet and availability of the tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and build discriminating attributes.
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.

Conclusion

Data Mining has evolved from a niche statistical practice into a cornerstone of modern Business Intelligence and Scientific Research. By bridging the gap between raw data storage and actionable knowledge, it allows organizations to move beyond mere "record-keeping" to predictive foresight.

Summary of Key Insights

The KDD Synergy: We have seen that Data Mining is not an isolated task but a critical step within the broader Knowledge Discovery in Databases (KDD) framework. While DBMS manages the data, Data Mining unearths the "gold" hidden within it through techniques like Classification, Clustering, and Association.

Methodological Rigor:The transition from Data Pre-processing to Pattern Evaluation ensures that the insights generated are not just statistically significant, but also practically relevant and noise-free.

Versatility Across Domains:From detecting fraudulent credit card transactions to mapping complex genomic sequences in bioinformatics, Data Mining serves as a universal tool for solving high-stakes problems.

Future Outlook & Responsibility

As we move further into the era of Big Data, the challenges of scalability and interpretability remain at the forefront. However, the most critical evolution lies in Ethics and Privacy. As future developers and data scientists, the goal is not just to build more powerful algorithms, but to ensure that data-driven discovery is conducted transparently and fairly.
In essence, Data Mining is the art of turning the "noise" of the digital age into the "signal" of progress. Whether it is improving healthcare outcomes or optimizing global supply chains, the patterns we discover today will define the innovations of tomorrow.

Related Articles

Wikiwand AI