Research Issues in Data Mining In this section, we briefly discuss some of the research issues in data mining. It can help business managers fmd and reach suitable customers as well as develop special intelligence to improve market share and profits. Here are some applications of data mining. Recent research in DNA data analysis has enabled the discovery of genetic causes of many diseases as well as discovery of new medicines. One of the important search problems in genetic analysis is similarity search and comparison among the DNA sequences.
|Published (Last):||27 August 2006|
|PDF File Size:||15.44 Mb|
|ePub File Size:||4.9 Mb|
|Price:||Free* [*Free Regsitration Required]|
Research Issues in Data Mining In this section, we briefly discuss some of the research issues in data mining. It can help business managers fmd and reach suitable customers as well as develop special intelligence to improve market share and profits. Here are some applications of data mining. Recent research in DNA data analysis has enabled the discovery of genetic causes of many diseases as well as discovery of new medicines.
One of the important search problems in genetic analysis is similarity search and comparison among the DNA sequences. Data mining techniques can be used to solve these problems. Intrusion Detection and Network Security: This will be discussed further in later chapters. Financial Data Analysis: Most financial institutions offer a variety of banking services such as credit and investment services. Data warehousing techniques can be used to gather the data to generate monthly reports.
Data mining techniques can be used to predict loan payments and customer credit policy analysis. Data Analysis for Retail Industry: Retail is a big application of data mining since it collects huge amount of data on sales, shopping history and service records. It can also be used to analyze effectiveness of sales campaigns. It also provides an application toolkit for neural network algorithms and data visualization. It provides multiple data mining algorithms including regression, classification and statistical analysis packages.
It also provides multiple data mining algorithms and advanced visualization tools. One distinguishing feature of MineSet is the set of robust graphics tools such as rule visualizer, tree visualizer and so on. It provides an integrated data mining development environment for end users and developers.
It provides multiple data mining algorithms including discovery driven OLAP analysis, association, classification and clustering. A distinct feature of DBMiner is its data cube based analytical mining.
Interested readers can consult surveys on data warehousing and data mining products. GEMS applications use an Integrated Database to store fault tickets, assets and inventory management information. An SLA is a contract between the service provider and a customer usually an enterprise on the level of service quality that should be delivered. An SLA can contain the following metrics: 1. Available network bandwidth e. Penalty e. In addition, the provider must have the ability to drill down to detailed data in response to customer inquires.
The main reason to separate the decision support data from the operation data is performance. Operational databases are designed for known transaction workloads. Moreover special data organization and access methods are required for optimizing the report generation process. This project also required data integration and data fusion from many external sources such as operational databases and flat files.
The main components used in our system are as follows. DataStage reads data from the source information repositories and it applies transformations as it loads all data into a repository atomic database. Once the atomic data repository is loaded with all source information a second level of ETL transformations is applied to various data streams to create one or more Data Marts. Data Marts are a special sub-component of a data warehouse in that they are highly de-normalized to support the fast execution of reports.
Some of these Data Marts are created using Star Schemas. Also, the execution of reports does not impact the applications that are using the source databases. The schemas in the DW are optimized by using de-normalization and pre-aggregation techniques. This results in much better execution time for reports. Some of the open research problems that we are currently investigating are: Time to refresh the data in the data warehouse was large and report generation activity had to be suspended until the time when changes were propagated into the DW.
Therefore, there was a need to investigate incremental techniques for propagating the updates from source databases Loading the data in the data warehouse took a long time 10 to 15 hours. In case of any crashes, the entire loading process had to be re-started. There was no good support for tracing the data in the DW back to the source information repositories. This process, which is used to design, deploy and manage the data marts is called the ETL Extract, Transform and Load process.
There are a number of open research problems in designing the ETL process. Maintenance of Data Consistency: Since source data repositories continuously evolve by modifying their content or changing their schema one of the research problems is how to incrementally propagate these changes to the central data warehouse. Both re-computation and incremental view maintenance are well understood for centralized relational databases. However, more complex algorithms are required when updates originate from multiple sources and affect multiple views in the Data Warehouse.
The problem is further complicated if the source databases are going through schema evolution. Maintenance of Summary Tables: Decision support functions in a data warehouse involve complex queries. It is not feasible to execute these queries by scanning the entire data.
Therefore, a data warehouse builds a large number of summary tables to improve performance. As changes occur in the source databases, all summary tables in the data warehouse need to be updated. A critical problem in data warehouse is how to update these summary tables efficiently and incrementally.
Incremental Resuming of Failed Loading Process: Warehouse creation and maintenance loads typically take hours to run. If the load is interrupted by failures, traditional recovery methods undo the changes. The administrator must then restart the load and hope that it does not fail again. More research is required into algorithms for resumption of the incomplete load so as to reduce the total load time.
Tracing the Lineage of Data: Given data items in the data warehouse, analysts often want to identify the source items and source databases that 20 Anoop Singhal produced those data items. Research is required for algorithms to trace the Uneage of an item from a view back to the source data items in the multiple sources.
Data Reduction Techniques: If the input data is very large, data analysis can take a very long time making such analysis impractical or infeasible. There is a need for data reduction techniques that can be used to reduce the data set so that analysis on the reduced set is more efficient and yet produce the same analytical results. The following are examples of some of the algorithmic techniques that can be used for data reduction.
These operations reduce the amount of data in the DW and also improve the execution time for decision support queries on data in the DW - Dimension Reduction: This is accomplished by detecting and removing irrelevant attributes that are not required for data analysis. Data Compression: Use encoding mechanisms to reduce the data set size. For example, values for numeric attributes like age can be mapped to higher-level concepts such as young, middle age, senior.
Data Integration and Data Cleaning Techniques: Generally, data analysis task includes data integration, which combine data from multiple sources into a coherent data store. These sources may include multiple databases or flat files. A number of problems can arise during data integration. Real world entities in multiple data sources can be given different names. How does an analyst know that employee-id in one database is same as employee-number in another database. We plan to use meta-data to solve the problem of data integration.
Data coming from input sources tends to be incomplete, noisy and inconsistent. If such data is directly loaded in the DW it can cause errors during the analysis phase resulting in incorrect results. Data cleaning methods will attempt to smooth out the noise, while identifying outliers, and correct inconsistencies in the data.
We are investigating the following techniques for noise reduction and data smoothing. Intuitively, values that fall outside of the set of clusters may be considered outliers. Using regression to find a mathematical equation to fit the data helps smooth out the noise.
Data pre-processing is an important step for data analysis. Detecting data integration problems, rectifying them and reducing the amount of data to be analyzed can result in great benefits during the data analysis phase.
However, the problem of propagating changes in a DW environment is more complicated due to the following reasons: a In a DW, data is not refreshed after every modification to the base data.
Rather, large batch updates to the base data must be considered which requires new algorithms and techniques. These transformations may include aggregating or summarizing the data.
Therefore techniques are required that can deal with both source data changes and schema changes. It enables users to "drill through" from the views in the DW all the way to the source data that was used to create the data in the DW.
However, their methods lack techniques to deal with historical source data or data from previous source versions. They typically use a multidimensional data model to facilitate data analysis. They are implemented using a three tier architecture.
The middle tier is a OLAP server and the top tier is a client, containing query and reporting tools. Data mining is the task of discovering interesting patterns from large amounts of data where data can be stored in multiple repositories. Efficient data warehousing and data mining techniques are challenging to design and implement for large data sets.
We have also described some open research problems that need to be solved in order to efficiently extract data from distributed information repositories.
We believe that there are several important research problems that need to be solved to build flexible, powerful and efficient data analysis applications using data warehousing and data mining techniques.
References 1. Chaudhuri, U. Conference on Data Engineering, Cui and J. Chaudhuri, G. Das and V. Weiss and J.
Syl Lbc a 191011
Goos, J. Hartmanis and J. Yahiko Kambayashi All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, , in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.
BCA Syllabus IPU
Each student shall be required to appear for examinations in all courses. However, for the award of the degree a student shall be required to earn the minimum of credits. Objectives To get the knowledge about the matrices, determinants and limits. Question No. This question should have objective or short answer type questions. It should be of 25 marks. Apart from Question No.
Ausgabe Nr.49 / 2009