Setting Parameters

There are four inherently ambiguous situations in the data where a feature value could be variously interpreted as absent or present, and simultaneously as a zero or an unknown. In these situations you need to tell OptiPath what to do by setting the feature parameters Earlier, Later, Blanks and Zeroes appropriately. In the first two cases you must let OptiPath know how items preceding or following those in your sample should be treated. For each feature you can indicate whether an earlier (or later) item, or at least this feature of the item, should be assumed to be present or absent or undetermined earlier or later. Also, you can indicate whether the value of the feature for an earlier (or later) item should be considered as zero or as an unknown quantity. For more detail see Earlier And Later.

Also inherently ambiguous in the data are blanks and zeroes. Some people use blanks as a short hand notation for values of 0, particularly in occurrence seriation where the values would normally be 0 and 1. For simplicity's sake only the 1's are entered in the data matrix. At other times blanks are supposed to indicate a feature is missing or absent for an item (for example, the feature might be the color of an ornamental band on pottery and the feature might be absent on some artifacts). Alternatively, a blank could signify the absence of data rather than the absence of the feature - we simply don't know anything about this feature for this artifact. Similarly, zeroes could be interpreted variously as zero values, unknown values, absence, presence, as presence and unknown.

Whatever convention we are using, we need to let OptiPath know by setting the feature parameters Earlier, Later, Blanks and Zeroes in the Features table.

The implication of treating earlier, later, blanks and zeroes as values, absences or unknowns is actually rather complicated. Consider the eleven features A, B, C, ..., K in Table 1 below. Each row reflects an item (artifact or assemblage) and each column represents a feature. For simplicity we will consider data entries of 0 and 1. This is typical of occurrence seriation. Part of the complication is we do not know a priori whether a 0 represents a measurement of a feature or an indication of absence of a feature or simply a lack of knowledge of a feature.

Table 1 Feature A Feature B Feature C Feature D Feature E Feature F Feature G Feature H Feature I Feature J Feature K
EARLIER
Item 1 1 0 0 1 1 0 0 1 1 0 0
Item 2 1 0 0 1 1 0 0 1 1 0 0
Item 3 1 0 0 1 1 0 0 1 1 0 0
Item 4 1 0 0 1 1 0 0 1 0 0 1
Item 5 1 0 0 1 1 0 1 0 0 1 1
Item 6 1 0 1 1 0 0 1 0 0 1 1
Item 7 1 0 1 1 0 1 1 0 0 1 0
Item 8 1 0 1 0 0 1 1 0 1 0 0
Item 9 1 0 1 0 0 1 0 1 1 0 0
Item 10 1 0 1 0 0 1 0 1 1 0 1
Item 11 0 0 1 0 1 1 0 1 0 0 1
Item 12 0 1 1 0 1 0 0 1 0 1 1
Item 13 0 1 1 0 1 0 1 0 0 1 0
Item 14 0 1 1 0 1 0 1 0 0 1 0
Item 15 0 1 1 1 1 0 1 0 1 0 0
Item 16 0 1 0 1 0 0 1 0 1 0 1
Item 17 0 1 0 1 0 1 0 1 1 0 1
Item 18 0 1 0 1 0 1 0 1 0 0 1
Item 19 0 1 0 1 0 1 0 1 0 1 0
Item 20 0 1 0 1 0 1 0 1 0 1 0
Item 21 0 1 0 1 0 1 0 1 0 1 0
LATER
Score I 1 1 1 2 2 2 2 3 3 3 3
Score II 1 1 2 3 3 4 4 5 5 6
Score III 2 2 3 4 5 5 6 7 8 8 9

Typically, in occurrence seriation, the objective is to order, or seriate, the artifacts so that the result is exactly one "string" of consecutive 1's (uninterrupted by 0's) in each column (feature). In the matrix above, features A, B and C have exactly one string of ones. Features D, E, F and G have exactly two strings of ones. Features H, I, J and K have exactly three strings of ones.

The question is, in terms of our objective to create exactly one string of 1's in each column, how would we rank the columns? Is A better or worse than C? Each has exactly one string of ones. There are two complications. The first is whether a 0 indicates a measurement of a feature or an absence of a feature or simply a lack of knowledge of a feature. The second is what is happening before the earliest artifact and after the latest.

If 1 represents the presence of a feature and 0 represents the absence, then (ignoring what's going on earlier and later) there is no reason to prefer the arrangement in column A to column C. In either case the feature appears in the archaeological record, persists uninterruptedly for some time and then disappears. However, if 0 and 1 represent two possible feature values (for example, "red" and "blue"), then presumably column A would be preferable to column C. In this case, feature A is blue for all artifacts for an uninterrupted period of time followed by an extended period where artifacts are red without exception; but feature C is blue for a while, then red, and then blue again. For feature C, the blues are interrupted. In other words, column C has an interrupted string of 0's. If 0's and 1's both represent feature values (rather than absences and presences of a single feature) then uninterrupted strings of 0's are just as important as uninterrupted strings of 1's.

If 0's represent unknowns, or the absence of data rather than the absence of a feature, then they are irrelevant. We cannot draw any conclusion in the absence of data and all eleven column patterns are equally desirable.

The situation is further complicated by considering artifacts earlier and later than those in the data. If earlier artifacts are assumed to have a feature value of 0, then feature A would be indistinguishable from feature C, which was not always the case otherwise as we saw above.

Ideally, in different situations you would like to be able to differentiate between columns A, B and C in the table above, each of which has one string of 1's. Similarly you might want to differentiate among D, E, F and G which each have two strings of ones, and among H, I, J and K which each have three strings of ones. The last three rows in the table introduce scoring systems which help to discriminate between different types of columns.

The first scoring mechanism (Score I) is simply a count of the number of strings of ones, so it helps in minimizing the number of strings created, but does not help in differentiating among strings with different patterns.

The second scoring mechanism (Score II) goes a little further. It differentiates between A or B and C. Unfortunately it fails to differentiate between C and D. This is because Score II is simply a count of how many times in a column there is a transition from 0's to 1's and from 1's to 0's, ignoring what comes earlier or later. This is equivalent to (actually one less than) counting the number of strings (both 0's and 1's) in a column.

The third scoring mechanism (Score III) is even better. It differentiates between A or B and C and between C and D. However, it still fails to differentiate between A and B. However, this is to be expected since seriation simply creates an ordering but does not tell you if your ordering is going forward or backward in time (the Earlier and Later feature parameters in OptiPath actually give you some control over this) and columns A and B are symmetric. One is simply the reverse of the other. Score III is simply the sum of Score I and Score II.

To accomplish Score I (there may be more than one way to do it) you can set the following feature parameters: Earlier = Absent, Later = Absent, Zeroes = Absent, Transition penalty = 0.5. It does not matter which distance function (Metric) you use.

To accomplish Score II you can set Earlier = Unknown, Later = Unknown, Zeroes = Absent, Transition penalty = 1. It does not matter which distance function (Metric) you use. Alternatively you could set Earlier = Unknown, Later = Unknown, Zeroes = Value, Transition penalty = 0 and use either the Manhattan or Hamming distance function.

To accomplish Score III you can set Earlier = Absent, Later = Absent, Zeroes = Value & Absent, Transition penalty = 0.5 and use either the Manhattan or Hamming distance function.

It is up to you to decide which of these scoring mechanisms (or some other) is best suited for your data. It is worth thinking about because even though the issues can be quite subtle the effects can be quite significant.