|iris Data Source|
|Edit Data Source|
|Edit Data Source Path|
Despite its spreadsheet feel and list of applied steps, the "Data Source" section has very few options. In fact, the steps we see utilized are ALL of the steps available. We cannot do any data transformation or manipulation in this tab. However, we do have an interesting option at the top of the tab called "Metrics".
Max Value: Largest Value in the Column
Min Value: Smallest Value in the Column
Count: Number of Records with Values in this column
Quantile at 50%: A measure of the "central" value in a dataset. If the dataset was sorted, 50% of the values would be equal to or below this value.
Median: Same as Quantile at 50%
Kurtosis: Steepness of the distribution, i.e. a measure of the number of extreme observations it generates.
Quantile at 75%: If the dataset was sorted, 75% of the values would be equal to or below this value.
Number of Missing Values: Number of Records with No Value in this column
Column Data Type: The type of values that appears in the column.
Standard Deviation: Spread of the distribution, i.e. a measure of the distance between values in the column.
Variance: Spread of the distribution, i.e. a measure of the distance between values in the column., i.e. a measure of the distance between values in the column. This is the square of the Standard Deviation.
Quantile at 25%: If the dataset was sorted, 25% of the values would be equal to or below this value.
Is Numeric Column: Whether the values in the columns are numbers and have the appropriate data type to match.
Number of NaNs: The number of records that contain values which are not numbers. This does not consider the data type of the column and does not include missing values.
Mean Value: A measure of the "central" value in a dataset. Calculated as the sum of all values in the column, divided by the number of values. Commonly referred to as the "Average".
Unbiased Standard Error of the Mean: A measure of the stability of "Sample Mean" across samples. If we assume our dataset is a sample of a larger distribution, then that distribution likely has a Mean. However, since our sample is only part of the overall distribution, the mean of the sample will take different values based on which records are included in the sample. Therefore, we can say that the sample mean has it's own distribution, known as a "Sampling Distribution". This distribution likely has a standard deviation. This is known as the "standard error of the sample mean".
Skewness: A measure of how NOT symmetric the distribution is, i.e. positive values signify the data has outliers larger than the mean, negative values signify the data has outliers smaller than the mean.
Most Common: The most common value in the column. Commonly called the "Mode". This only
applies to non-numeric columns.
Count of Most Common: The number of Records that contain the most common value. This only applies to non-numeric columns.
Number of Unique Values: The number of distinct values within the column, i.e. every distinct value is counted only once, regardless of how many times it appears in the column.
|Prepare (from Data view)|
|Prepare (from Metrics view)|
Instead of going through the rest of the options available for editing in the Data Sources tab, let's create a new data source to see it from scratch. We've made a copy of the "iris.csv" file on our local machine.
|Add Data Source|
|File Details 1|
|File Details 2|
|File Details 3|
|Data Types 1|
|Data Types 2|
|iris - Copy Data Source|
Senior Analytics Associate - Data Science