Preprocess

Preprocess is used to transform data in accordance with modeling experiment conditions configured in the project. Actually preprocess uses a script which is either loaded from a template, or created visually using the panel Variables and transformations.

Variables and transformations

Variables and transformations is a panel accessible via the program menu View > Variables and transformations. When enabled, the panel is nested in the Data explorer tab. It is used to specify model inputs, targets (predicted columns) and their transformations.

View by datasets / variables is a list of datasets connected to the project. The panel is used to selected and then add variables to Input variables and Targets variables. If you click at the underlined part of the title (variables), the list will be rearranged showing variables in the top of the hierarchy and the datasets they come from as nested nodes. In this way you can select and manipulate groups of variables that appear in more than one dataset.

Input variables [n] is a list of input variables and transformations applied to them. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …

All and None in the panel title are used to select and deselect all items in the list at a time.

Edit button in the panel title is used to edit preprocessor script manually via the text editor.

Target variables [n] is a list of variables that will be modeled and predicted. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …

Transformations is a panel used to access various transformations available for Input variables and Target variables.

Settings is a button in the panel title used to modify Obligatory transformations applied to the whole dataset.

Obligatory transformations

Obligatory transformations is a dialog accessible via the Settings (underlined text button) in the title of the Transformations panel (View > Variables and transformations). It is used to set preprocessing rules applied to all variables.

Text data

Manage text inputs is used to handle input variables with categories (instead of numeric values). Available options are
Stop with error | Skip column | Decompose to binary columns

Manage text targets is used to handle target variables with categories (instead of numeric values). Available options are
Stop with error | Enumerate with 0,1,2,.. | Decompose to binary columns

Limit decomposition to the most frequent labels is used to limit the number of new variables created in a result of decomposition. Only the most frequent labels are decomposed to separate binary variables. All other labels will be dropped to a binary variable named '?'. The number of new variables including '?' does not exceed the selected limit.

Missing values

Treat value as missing if it is outside n*sigma is used to replace spikes (i.e. outliers) that exceed certain number of standard deviations. All spikes that exceed n*sigma will be replaced with missing values. The way in which missing values are handled can be set using the Manage missing values control.

The first option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighbouring values (interpolation), arithmetic mean of current variable, median, and the most frequent value.

Multiple files

If time series data is collected from a set of files it is possible to ensure that all files contain recent observations and all series are long enough to be preprocessed properly.

For short files that can't be preprocessed properly there are two options: Stop with error and Skip silently. This parameter matters only when a set of files is imported. If a file contains insufficiently small number of observations you can instruct preprocess to skip the file silently instead of throwing an error.

Non-recent data detection finds out if all series comes with the most recent timestamp found in the set of files. There are three options available:
No detection | Stop with error | Skip silently
The last one is used to simply drop non-recent files from the preprocessing results.

Preprocess types

GMDH Shell uses different sets of preprocessing features for time series models and all other models, i.e. classification and regression models. A proper set of features is usually loaded form a template, to change it open the menu File > Configuration, Modules tab, Preprocess.

Simple forecasting

Simple forecasting is a panel in the program toolbar used to create time series models.

Horizon is the number of observations to be forecasted.

Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

Time series forecasting

Time series forecasting is a panel in the program toolbar used to create time series models.

Horizon is the number of observations to be forecasted. GMDH Shell can perform several simulations with different forecast horizons at a time. In order to use this feature you should separate desired forecast horizons with a coma. For example 1,6,12.

Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

Repeat is used to perform ex-post forecast of several previous horizons at a time.

Window validation is the number of experiments used for window size optimization. Optimal window size improves forecast accuracy.

Regression and classification

Hold-out data panel is used to apply regression and classification models to a subset of observations immediately after the models are obtained. The hold-out sample can be either used to evaluate out-of-sample accuracy or to predict unknown values of target variables.

Hold-out is used to select a subsample to which the model will be applied. Available options are to hold-out last observations | observations uniformly | missing target values

In the next two control elements you can set the exact number of dataset observations or % of dataset to be reserved for the hold-out.

Preprocess script

The script is used to specify commands that generate input variables, target variables and various transformations. Preprocess commands are one line, each of them can refer to a group of variables.

One line commands have the following format:

FileName.VariableName, Transformation1(), Transformation2(), …

If there is only one file connected, “FileName.” prefix is not required. If a spreadsheet (.xls or .xlsx) contain data in more than one sheet, all variables must be refereed as “FileName:SheetName.VariableName”.

To refer a group of variables use the asterisk symbol “*”, for example:

FileName.* is used to select all variables from a file.
*.VariableName is used to select variables with certain name from all files.
*.* is used to select all variables from all files (and all sheets).
* is used to select all variables if there is only one file connected.

If a file or a variable contain spaces or reserved symbols in the name, each of the names must be quoted, for example: “File Name”.“Variable Name”. If the name contain a quote “, it must be doubled, for example a”b“c ⇒ “a”“b”“c”

A semicolon symbol ”;“ allows one line comments.

Transformations

Here is the list of available transformations.

Elementary functions

Transformation	Notation example	Description
Square	`x, sqr`	`y=x^2`
Square root	`x, sqrt`	`y=x^(1/2)`
Cube	`x, cube`	`y=x^3`
Cube root	`x, cubert`	`y=x^(1/3)`
Exp	`x, exp`	`y=exp(x)`
Logarithm	`x, ln`	`y=ln(\|x\|), x<>0`
Sine	`x, sin`	`y=sin(x)`
Cosine	`x, cos`	`y=cos(x)`
Arctangent	`x, arctang`	`y=arctang(x)`
Abs value	`x, abs`	`y=\|x\|`
Sign	`x, sign`	`y=sign(x)`
Floor	`x, floor`	`y=[x]`
Fractional part	`x, frac`	`y=x-[x]`
Normalization	`x, norm(b1,b2)`	`b1` is lower boundary, `b2` is upper boundary

Time series

Transformation	Notation example	Description
Lag	`x@a-b:c`	`a` is min lag, `b` is max lag, `c` is step. Example: step of 3 applied to 0-12 leaves only 0, 3, 6, 9 and 12. This helps to reduce the number of variables.
All lags	`x@*:a`	`a` is step. Generates lags while dataset length allows. Applies for inputs only.
Moving average	`x,SMA(a)`	`a` is window length; an integer 2..1000
Exponential MA	`x,EMA(a)`	`a` is quotient; a real [0.01, 1)
Derivative	`x,d`	y = x[t] - x[t-1]
Window size	`x,window(a)`	`a` is window size; an integer ≥ 2. Applies to targets only.
Weighted by time	`x,weighted_by_time`	Sets higher weights for later observations in proportion: 1, 2, 3, …, n
Fourier series	`x,fourier(a)`	`a` is period; an integer ≥ 2. Fourier series: cos and sin(2πkx/T). Generates multiple variables, does not stack with other generating transformations

Date/time

Transformation	Notation example	Description
Year	`x,year`	Year number in Gregorian calendar.
Year fraction	`x,year_frac`	Year fraction in Gregorian calendar. 0 = Jan 1st 00:00, 1 = Dec 31st 24:00.
Month	`x,month,decompose`	Month number in Gregorian calendar. 1 = Jan, 12 = Dec.
Month fraction	`x,month_frac`	Month fraction in Gregorian calendar. 0 = 1st 00:00, 1 = 31st 24:00.
Day of month	`x,day,decompose`	Day of month in Gregorian calendar (1-31).
Day of week	`x,dayofweek,decompose`	Day of week, 1 = Mon, 7 = Sun.
Day fraction	`x,day_frac`	Fraction of the day. 0 = 00:00, 1 = 24:00.
Hour	`x,hour,decompose`	Hour (0..23).
Hour fraction	`x,hour_frac`	Fraction of the hour. 0 = ##:00:00, 1 = ##:60:00.
Minute	`x,minute,decompose`	Minute (0..59).
Second	`x,second,decompose`	Second (0..59).

Calendar

Transformation	Notation example	Description
Day off	`x,isdayoff`	Returns 1, when the dat is Saturday, Sunday or a holiday of specific country. Otherwise returns 0.
Workdays per period	`x,workdaysperperiod`	Returns the number of workdays (Mo-Fr, except holidays) in a period (week, month, etc.).
Decomposed holidays	`x,holidays`	For each holiday in the holiday set, returns 1 if it falls into a period (week, month, etc.). Produces as many variables as holidays in the year.
Is holiday	`x,isholiday`	Returns 1 if a holiday falls into a period (week, month, etc.).
Find non-holiday	`x,nonholiday_day`	Steps a day(week) back until a non-holiday is found.

Weighted instances

Transformation	Notation example	Description
Weighted by time	`x,weighted_by_time`	Sets higher weights for later observations in proportion: 1, 2, 3, …, n.
Balanced classes	`x,balanced_classes`	Changes instance weights so that classes will have equal importance for training regardless of their proportion.
Manual 2-class bias	`x,weighted_two_class(0.5)`	Manually weighs the upper class among the two.

Special variables

Transformation	Notation example	Description
Time	`\|time`	Time variable is simply a counter: 0, 1, 2, 3, … .
Target	`\|target`	Feeds target variable to input. This variable is feasible when solving multiple time series problems, or when making a template.
ID	`\|id`	Feeds ID or date/time variable of the dataset.
First column	`\|firstcolumn(a)`	`a` is step; an integer ≥ 4. Feeds the N'th variable counting from the first one.
Last column	`\|lastcolumn(a)`	`a` is step; an integer ≥ 4. Feeds the N'th variable counting from the last one.

Transformation	Notation example	Description
Decompose categories	`x,decompose`	Decomposes categorical column to binary columns.

~~UP~~

GMDH Shell Documentation

Sidebar

External links

General topics

Reference

Table of Contents

Preprocess

Variables and transformations

Obligatory transformations

Text data

Missing values

Multiple files

Preprocess types

Simple forecasting

Time series forecasting

Regression and classification

Preprocess script

Transformations

Elementary functions

Time series

Date/time

Calendar

Weighted instances

Special variables

GMDH Shell Documentation

User Tools

Site Tools

Sidebar

External links

General topics

Reference

Table of Contents

Preprocess

Variables and transformations

Obligatory transformations

Text data

Missing values

Multiple files

Preprocess types

Simple forecasting

Time series forecasting

Regression and classification

Preprocess script

Transformations

Elementary functions

Time series

Date/time

Calendar

Weighted instances

Special variables

Page Tools