DATA REVIEW

"Dyad-years and Data Management"

Paul D. Senese
Assistant Professor
University at Buffalo, SUNY

Journal articles using the now-familiar dyad-year format within their data analyses have increased markedly in recent years. To a large extent, this has been a positive development as it has allowed us to more closely match our empirical tests to the theory driving them. Fueling the use of the dyad-year design has been the development of the EUGene (Expected Utility Generation and Data Management Program) software by D. Scott Bennett Jr. and Allan C. Stam III. Bennett and Stam's EUGene program has greatly improved the facility with which we can generate data sets that are amenable to the study of interstate conflict processes. Without hesitation, I claim that the EUGene program has saved many of us countless hours that used to be spent merging, formatting, and manually adjusting various components and sub components of data sets. Further, the documentation accompanying the program, along with their article in International Interactions (Bennett and Stam 2000), provides a wealth of information and guidance about the program and its capabilities (the newest version of the software and documentation is available at www.eugenesoftware.org).

However, as with almost any tool and regardless of how much documentation developers provide, users of EUGene should be aware of certain nuances associated with some of the data sets that it produces. One example of this that has received little attention relates to the above-mentioned dyad-year format. A common use of this format finds the user seeking information about not only whether a pair of states engaged in a Militarized Interstate Dispute (MID), or war, but also certain other characteristics associated with the dyad in that given year. Based on this, we can compare the characteristics of dyads that tend to engage in conflictual behavior to those that do not. Studies focusing on escalation often seek to examine the probability that a MID will escalate to still higher levels of hostility (including war). Other studies may instead be interested in tracing out the duration of peace periods between bouts of conflict among states. Such studies require information not only about the beginning and ending dates of conflicts, but also about the varying nature of particular explanatory variable values (i.e., time varying covariates) during the period under study.

When scholars conducting such studies begin construction of their data sets with dyad-year information produced by EUGene, however, they run the risk of omitting potentially important data. If a dyad-year is characterized by the presence of a MID, EUGene provides the user with information pertinent to the dispute. The potential problem emerges when a pair of states actually engages on opposite sides of more than one MID in a particular year. Under this condition where there are multiple disputes in a given dyad-year, EUGene provides the relevant information for the first dispute to occur (see the EUGene documentation for more details), but information about subsequent disputes in the same year is omitted from the generated output. This raises potential problems for users who do not realize those cases are absent and then form conclusions based on analyses of these data.

For example, analyses looking at the duration of time between conflicts may be missing peace periods that last a matter of mere months, since there is no record of the second (or third, etc.) dispute in a year. This could lead to the observation of inaccurately long spans of peace between states. Further, examinations focused on escalation levels will miss the possibility that subsequent disputes may reach higher levels of intensity (even war) than does the first MID in a year. In fact, among the 2040 disputes contained within the MID 2.1 set, fully 273 are not the first disputes in a dyad-year. Thus, analyses that do not adjust for this reality lose information pertaining to more than 13% of the MIDs that occur between 1816 and 1992. The US-Russia case is most susceptible to this problem with fully twenty-four MIDs that are not the first to occur in a year. The most prominent dyad-year in this regard is the U.K. and Russia with six MIDs in 1918.

A possible solution to these problems might have the user merge EUGene dyad-year data into a EUGene dyadic dispute set. This could be done with various statistical or data base software programs. The result would be some sort of hybrid set, with most rows representing a dyad-year, but 273 embodying a "partial dyad-year." Putting aside the problem of finding a name for this type of observational unit, it would allow for the improved consideration of escalation and duration processes. Also, in future iterations of the EUGene program, Bennett and Stam could include a "#of disputes" variable that indicates the number of MIDs beginning in a dyad-year. Finally, they might also consider the option of allowing the user to output information pertaining to the most intense MID, instead of the first, for a dyad-year.

Other examples and descriptions could be mentioned here, but the basic point would remain the same: we must be careful to learn about our data before we plow forward with complex statistical analyses. This does not mean that we should spend years pouring over datums in our sets; far from it in fact, as all of the problems mentioned above could be unearthed with little difficulty. Software programs such as EUGene are superb instruments for increasing the efficiency with which we investigate compelling research questions. At the same time, however, even these useful tools must be used with at least a modicum of care and responsibility.

References

Bennett, D. Scott, and Allan Stam. 2000. "EUGene: A Conceptual Manual." International Interactions, 26:179-204.
Website: www.eugenesoftware.org.


Return to the June 2001 CP Newsletter.

Return to the CPS Home Page.