Preprocess is used to transform data in accordance with modeling experiment conditions configured in the project. Actually preprocess uses a script which is either loaded from a template, or created visually using the panel Variables and transformations.
Variables and transformations is a panel accessible via the program menu View > Variables and transformations. When enabled, the panel is nested in the Data explorer tab. It is used to specify model inputs, targets (predicted columns) and their transformations.
View by datasets / variables is a list of datasets connected to the project. The panel is used to selected and then add variables to Input variables and Targets variables. If you click at the underlined part of the title (variables), the list will be rearranged showing variables in the top of the hierarchy and the datasets they come from as nested nodes. In this way you can select and manipulate groups of variables that appear in more than one dataset.
Input variables [n] is a list of input variables and transformations applied to them. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …
All and None in the panel title are used to select and deselect all items in the list at a time.
Edit button in the panel title is used to edit preprocessor script manually via the text editor.
Target variables [n] is a list of variables that will be modeled and predicted. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …
Transformations is a panel used to access various transformations available for Input variables and Target variables.
Settings is a button in the panel title used to modify Obligatory transformations applied to the whole dataset.
Obligatory transformations is a dialog accessible via the Settings (underlined text button) in the title of the Transformations panel (View > Variables and transformations). It is used to set preprocessing rules applied to all variables.
Manage text inputs is used to handle input variables with categories (instead of numeric values). Available options are
Stop with error | Skip column | Decompose to binary columns
Manage text targets is used to handle target variables with categories (instead of numeric values). Available options are
Stop with error | Enumerate with 0,1,2,.. | Decompose to binary columns
Limit decomposition to the most frequent labels is used to limit the number of new variables created in a result of decomposition. Only the most frequent labels are decomposed to separate binary variables. All other labels will be dropped to a binary variable named '?
'. The number of new variables including '?
' does not exceed the selected limit.
Treat value as missing if it is outside n*sigma is used to replace spikes (i.e. outliers) that exceed certain number of standard deviations. All spikes that exceed n*sigma
will be replaced with missing values. The way in which missing values are handled can be set using the Manage missing values control.
Manage missing values is used to handle various gaps (NULL values) in the data. Available options are
Stop with error | 0 | Interpolate | Mean | Median | Most frequent.
The first option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighbouring values (interpolation), arithmetic mean of current variable, median, and the most frequent value.
If time series data is collected from a set of files it is possible to ensure that all files contain recent observations and all series are long enough to be preprocessed properly.
For short files that can't be preprocessed properly there are two options: Stop with error and Skip silently. This parameter matters only when a set of files is imported. If a file contains insufficiently small number of observations you can instruct preprocess to skip the file silently instead of throwing an error.
Non-recent data detection finds out if all series comes with the most recent timestamp found in the set of files. There are three options available:
No detection | Stop with error | Skip silently
The last one is used to simply drop non-recent files from the preprocessing results.
GMDH Shell uses different sets of preprocessing features for time series models and all other models, i.e. classification and regression models. A proper set of features is usually loaded form a template, to change it open the menu File > Configuration, Modules tab, Preprocess.
Simple forecasting is a panel in the program toolbar used to create time series models.
Horizon is the number of observations to be forecasted.
Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.
Time series forecasting is a panel in the program toolbar used to create time series models.
Horizon is the number of observations to be forecasted. GMDH Shell can perform several simulations with different forecast horizons at a time. In order to use this feature you should separate desired forecast horizons with a coma. For example 1,6,12
.
Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.
Repeat is used to perform ex-post forecast of several previous horizons at a time.
Window validation is the number of experiments used for window size optimization. Optimal window size improves forecast accuracy.
Hold-out data panel is used to apply regression and classification models to a subset of observations immediately after the models are obtained. The hold-out sample can be either used to evaluate out-of-sample accuracy or to predict unknown values of target variables.
Hold-out is used to select a subsample to which the model will be applied. Available options are to hold-out last observations | observations uniformly | missing target values
In the next two control elements you can set the exact number of dataset observations or % of dataset to be reserved for the hold-out.
The script is used to specify commands that generate input variables, target variables and various transformations. Preprocess commands are one line, each of them can refer to a group of variables.
One line commands have the following format:
FileName.VariableName, Transformation1(), Transformation2(), …
If there is only one file connected, “FileName.” prefix is not required. If a spreadsheet (.xls or .xlsx) contain data in more than one sheet, all variables must be refereed as “FileName:SheetName.VariableName”.
To refer a group of variables use the asterisk symbol “*
”, for example:
FileName.*
is used to select all variables from a file.
*.VariableName
is used to select variables with certain name from all files.
*.*
is used to select all variables from all files (and all sheets).
*
is used to select all variables if there is only one file connected.
If a file or a variable contain spaces or reserved symbols in the name, each of the names must be quoted, for example: “File Name”.“Variable Name”. If the name contain a quote “, it must be doubled, for example a”b“c ⇒ “a”“b”“c”
A semicolon symbol ”;
“ allows one line comments.
Here is the list of available transformations.
Transformation | Notation example | Description |
Square | x, sqr | y=x^2 |
Square root | x, sqrt | y=x^(1/2) |
Cube | x, cube | y=x^3 |
Cube root | x, cubert | y=x^(1/3) |
Exp | x, exp | y=exp(x) |
Logarithm | x, ln | y=ln(|x|), x<>0 |
Sine | x, sin | y=sin(x) |
Cosine | x, cos | y=cos(x) |
Arctangent | x, arctang | y=arctang(x) |
Abs value | x, abs | y=|x| |
Sign | x, sign | y=sign(x) |
Floor | x, floor | y=[x] |
Fractional part | x, frac | y=x-[x] |
Normalization | x, norm(b1,b2) | b1 is lower boundary, b2 is upper boundary |
Transformation | Notation example | Description |
Lag | x@a-b:c | a is min lag, b is max lag, c is step. Example: step of 3 applied to 0-12 leaves only 0, 3, 6, 9 and 12. This helps to reduce the number of variables. |
All lags | x@*:a | a is step. Generates lags while dataset length allows. Applies for inputs only. |
Moving average | x,SMA(a) | a is window length; an integer 2..1000 |
Exponential MA | x,EMA(a) | a is quotient; a real [0.01, 1) |
Derivative | x,d | y = x[t] - x[t-1] |
Window size | x,window(a) | a is window size; an integer ≥ 2. Applies to targets only. |
Weighted by time | x,weighted_by_time | Sets higher weights for later observations in proportion: 1, 2, 3, …, n |
Fourier series | x,fourier(a) | a is period; an integer ≥ 2. Fourier series: cos and sin(2πkx/T). Generates multiple variables, does not stack with other generating transformations |
Transformation | Notation example | Description |
Year | x,year | Year number in Gregorian calendar. |
Year fraction | x,year_frac | Year fraction in Gregorian calendar. 0 = Jan 1st 00:00, 1 = Dec 31st 24:00. |
Month | x,month,decompose | Month number in Gregorian calendar. 1 = Jan, 12 = Dec. |
Month fraction | x,month_frac | Month fraction in Gregorian calendar. 0 = 1st 00:00, 1 = 31st 24:00. |
Day of month | x,day,decompose | Day of month in Gregorian calendar (1-31). |
Day of week | x,dayofweek,decompose | Day of week, 1 = Mon, 7 = Sun. |
Day fraction | x,day_frac | Fraction of the day. 0 = 00:00, 1 = 24:00. |
Hour | x,hour,decompose | Hour (0..23). |
Hour fraction | x,hour_frac | Fraction of the hour. 0 = ##:00:00, 1 = ##:60:00. |
Minute | x,minute,decompose | Minute (0..59). |
Second | x,second,decompose | Second (0..59). |
Transformation | Notation example | Description |
Day off | x,isdayoff | Returns 1, when the dat is Saturday, Sunday or a holiday of specific country. Otherwise returns 0. |
Workdays per period | x,workdaysperperiod | Returns the number of workdays (Mo-Fr, except holidays) in a period (week, month, etc.). |
Decomposed holidays | x,holidays | For each holiday in the holiday set, returns 1 if it falls into a period (week, month, etc.). Produces as many variables as holidays in the year. |
Is holiday | x,isholiday | Returns 1 if a holiday falls into a period (week, month, etc.). |
Find non-holiday | x,nonholiday_day | Steps a day(week) back until a non-holiday is found. |
Transformation | Notation example | Description |
Weighted by time | x,weighted_by_time | Sets higher weights for later observations in proportion: 1, 2, 3, …, n. |
Balanced classes | x,balanced_classes | Changes instance weights so that classes will have equal importance for training regardless of their proportion. |
Manual 2-class bias | x,weighted_two_class(0.5) | Manually weighs the upper class among the two. |
Transformation | Notation example | Description |
Time | |time | Time variable is simply a counter: 0, 1, 2, 3, … . |
Target | |target | Feeds target variable to input. This variable is feasible when solving multiple time series problems, or when making a template. |
ID | |id | Feeds ID or date/time variable of the dataset. |
First column | |firstcolumn(a) | a is step; an integer ≥ 4. Feeds the N'th variable counting from the first one. |
Last column | |lastcolumn(a) | a is step; an integer ≥ 4. Feeds the N'th variable counting from the last one. |
Transformation | Notation example | Description |
Decompose categories | x,decompose | Decomposes categorical column to binary columns. |
~~UP~~