Euclidean Distance

Euclidean distance is the metric (distance function) normally used in everyday computations of distance between two points A and B. The feature parameter Metric allows users to specify the distance function to be used in computing each feature's contribution to the distance between two artifacts A and B. The metric can be set in the Features table.

Euclidean distance specifies that the distance between two artifacts A and B is the square root of the sum of the squares of the differences in the separate dimensions (features). Euclidean distance is the shortest distance, or the "as the crow flies" distance between two points.

For example, with two features the distance d(A, B) from artifact A to artifact B would be given by the formula:

where X is the difference in the two artifacts' values for one feature, and Y is the difference in the values for the other feature. In general, if we have n features and two artifacts A and B whose feature values are (a1 , a2 , a3 , ... , an ) and (b1 , b2 , b3 , ... , bn ) respectively:

The feature parameter Metric can be used to specify that a feature should use the Euclidean distance function in computing distances between artifacts. The alternatives to Euclidean distance are Manhattan distance and Hamming distance. If you choose to use different metrics for different features of your data, particularly if any feature uses Hamming distance, it may be advisable to normalize your data, particularly for features using Euclidean or Manhattan distance.

One difference between Euclidean distance and Manhattan distance is that Euclidean distance penalizes large distances disproportionately more than small distances. Using Euclidean distance, the distance between two artifacts which differ by one unit in each of two features (the square root of two) is less than the distance between two artifacts which differ by two units in only one feature (two); whereas they would both be equal (two) using Manhattan distance.