One of the issues I find myself thinking about in the murkier moments of paleoanthropological reflection is the nature of the data available to us. I don’t mean by this the question of “how complete is the fossil record,” but rather “what is the fossil record” with relation to how we use data to test hypotheses. With renewed attention on issues of “access” in paleoanthropology and the increasing importance of data that involves intensive use of technology, it is worth considering the basic question of what paleoanthropological data is and how it affects our treatment of questions in paleoanthropology.
As an example, consider my favorite fossil specimen, the 2600 mandible from Dmanisi, seen here in all its glory:
This is a massive early Homo mandible, with a fantastic amount of dental attrition and a striking number of odd or noteworthy features. The amount of reconstruction that has gone into this mandible is very minimal – you can see the slight crack near the coronoid process, for example – meaning this specimen is a pretty accurate representation of what the mandible actually looked like in life. In other words, reconstruction is not a part of the data acquisition process for this specimen and we can consider it primary data. The specimen does lack both gonial regions, the posterior edge of the ascending ramus, and a number of teeth that were lost in vivo. While not complete, the specimen is a pretty nice fossil as far as fossils go, and can therefore be easily converted into the kind of data that we can use to test a hypothesis.
However, there are additional accessory items that are part of the primary data associated with this fossil. This mandible was not found in isolation, it was found at a specific location, within a specific sedimentary layer, surrounded by specific fossils, archaeological materials and other sedimentary elements within the Dmanisi site. I was not yet working at Dmanisi when this mandible was found, but I have been fortunate enough to spend a lot of time digging in the immediate vicinity of where D2600 was found in the years since. This information is important because it allows us to have a greater sense about the context from which the fossil derives. In the absence of this information, information we lack to varying degrees for many fossil finds, uncertainty about these contextual bits of data persist potentially forever.
The fossil itself, however, is rarely the data we use for analysis. The data we use to test hypotheses can range from detailed anatomical description to metric assessments to 3-d imaging analysis. Though the fossil is our primary source of information, the information it contains, untranslated, is too complex to be incorporated into a hypothesis testing framework. Instead we reduce the complexity into something we can work with, but seldom do we acknowledge that in doing so we are selecting a limited amount of the available information. It then becomes a complex theoretical issue to choose between choosing a way of representing the fossil that allows you to incorporate it into a hypothesis testing framework and assuring yourself you are thoroughly and correctly representing the form and variation encompassed by the fossil.
For example, I could take a set of measurements of corpus height off of the D2600 mandible and use that as my representation of the specimen as a whole. The corpus height is certainly an interesting aspect of the specimen and one worth pursuing, so why not? It turns out that if you compare the corpus height of D2600 to the remainder of the Dmanisi sample, the difference is extremely large…greater than you would expect to find in a similar sample of humans, chimps and possibly even gorillas (I observed this in my dissertation, Skinner et al. 2006 observed the same in their analysis). If you take the same approach to looking at the breadth of the mandibular corpus, however, you would reach an opposite conclusion. To understand which measurement is better requires a lot of knowledge about how each trait varies within and between taxa, information that is often difficult to come by or requires a whole additional set of assumptions about the nature of variation in an unknown fossil taxa.
An alternative approach is to look at a large set of measurements (or landmark points or other spatial representation) simultaneously, incorporating multivariate methods to assess patterns of variation. This is great, but produces its own problem with fossils that are fragmentary and therefore limit the number of measurements available across samples for comparisons. For example, if you wanted to compare M1 measurements between D2600 and the other Dmanisi mandibles, a standard metric for comparison of mandibular dentition or corpora, you would be disappointed to discover that the absence of well preserved M1s in 2600 prevent such comparisons, despite the overall solid preservation of the specimen. Given the rarity of fossils, it is frustrating to throw out a large number, or even some of them, because they lack available or homologous measurements. So the process of transcribing the fossil itself, the primary data, into secondary data that is useable is fraught with challenges and choices.
Given the complex nature of fossil data, in its various constructions, the issue of access to data is also complex. I will get into the challenges of granting “access” to fossils and fossil data in a follow-up post.