
In the observational setting, data is usually “collected” from prevailing databases, data warehouses, and data marts.ĭata preprocessing usually includes a minimum of two common tasks : If this is often not case, estimated model cannot be successfully utilized in a final application of results. Also, it is important to form sure that information used for estimating a model and therefore data used later for testing and applying a model come from an equivalent, unknown, sampling distribution. It is vital, however, to know how data collection affects its theoretical distribution since such a piece of prior knowledge is often useful for modeling and, later, for ultimate interpretation of results. Typically, sampling distribution is totally unknown after data are collected, or it is partially and implicitly given within data-collection procedure. This is often referred to as observational approach.Īn observational setting, namely, random data generation, is assumed in most data-mining applications. The second possibility is when expert cannot influence data generation process. This approach is understood as a designed experiment. The primary is when data-generation process is under control of an expert (modeler).

Generally, there are two distinct possibilities. This step cares about how information is generated and picked up.

It continues during whole data-mining process. In successful data-mining applications, this cooperation does not stop within initial phase. In practice, it always means an in-depth interaction between data-mining expert and application expert. The primary step requires combined expertise of an application domain and a data-mining model. There could also be several hypotheses formulated for one problem at this stage. In this step, a modeler usually specifies a group of variables for unknown dependency and, if possible, a general sort of this dependency as an initial hypothesis.

