Washington University, St. Louis Jeff Gill
Voice: 314-935-9012
Fax: 314-935-5856

Political Science 582: Quantitative Analysis in Political Science II
  • Course Description: Official language.... "Seminar, 4 hours. Prerequisite: linear models. More advanced topics in the use of statistical methods, with emphasis on political applications. Topics include: properties of least squares estimates, problems in multiple regression, and advanced topics (probit analysis, simultaneous models, time-series analysis, etc."

    What this really means.... This course extends what you did in the linear models course by focusing more on nonlinear model forms. These are typically called "generalized linear models," although for historical reasons people in political science call them "maximum likelihood models." The principle we will care about is how to adapt the standard linear model that you know so that a broader class of outcome variables can be accommodated. These include: counts, dichotomous outcomes, bounded variables, and more. There is a strong theoretical basis for the models that we will use. Also, the bulk of the learning in the course will take place outside of the classroom by reading, practicing using statistical software, replicating the work of others, and doing problem sets. Keep in mind that the skills attained in this course are those that the discipline of political science expects of any self-declared data-oriented researcher.

    The second aspect of the course is focused on the statistical package R which is completely free for downloading for Mac, Unix, Linux and that other platform at CRAN, the Comprehensive R Archive Network. R is an implementation of the S language, which is the default computational tool for research statisticians. Quite simply R is the most powerful, extensively featured, and capable statistical computing tool that has ever existed on this planet. And as mentioned, its free. We will not use Stata; don't ask.

  • Prerequisite Details: The only official prerequisite for this course is a course on linear models. For political science graduate students, Political Science 581 is adequate. However, each student should be familiar with: basic probability theory, statistical inference, hypothesis testing, and least squares estimation. The course will also assume a working knowledge of calculus and linear algebra at the level of Essential Mathematics for Political and Social Research. Jeff Gill, 2006, Cambridge University Press. Since students come to the course with varying levels of experience with statistical packages like R, some may spend quite a bit of time learning basic programming skills. If you suspect that you are in this group, it will pay to spend some time with a basic text such as An R and S-Plus Companion to Applied Regression. John Fox, 2002, Sage.

  • Course Grade: The final grade will be based on two components: weekly problem sets (40%), a midterm exam (20%), and a replication assignment (40%). The problem sets will be a combination of analytical and computational assignments and given in class each week. For the replication assignment, find a published work in your field of interest, obtain the data, and exactly replicate the author's model results. It is usually easier to find an article that uses the readily available datasets in the discipline (COW, ANES, GSS, etc.), but some authors are forthcoming about distributing their data if asked. The relevant model should be one of the nonlinear forms studied in this course. Finally, you should complete the reading before the class listed.

  • Office Hours: by appointment. (email Sue Tuhro to schedule)

  • Incompletes: Absolutely none given.

  • Teaching Assistant: Chris Claasen. Office Hours: Monday 1-2pm, Seigle 275.

  • Homework: assigned each week and due the following week. No late homework accepted. All homework must be LaTeX'd.

  • Required Texts:
    Faraway Book    
    Title: Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models.
    Author: Faraway.
    Publisher: Chapman & Hall/CRC.
    Edition: First.
    ISBN: 158488424.
    UPM Book     Title: A Guide to Econometrics.
    Author: Kennedy.
    Publisher: MIT Press, 2003.
    Edition: Fifth or Sixth.
    ISBN: 0-262-61183-X.


  • Optional Texts (these are for background; see me before making any purchases):
    1. Title: Generalized Linear Models: A Unified Approach.
      Author: Gill.
      Publisher: Sage, 2001.
      Edition: First.
      ISBN: 0761920552.
    2. Title: Modern Applied Statistics with S.
      Author: Venables and Ripley
      Publisher: Springer-Verlag, 2003.
      Edition: Fourth.
      ISBN: 0387954570.
    3. Title: An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics.
      Author: R Development Core Team
      Available (free) online here
      Version 1.1, June 15, 2000
    4. Title: Linear Models with R.
      Author: Faraway.
      Chapman & Hall/CRC
      Edition: First.
      ISBN: 1-58488-425-8.

  • List of Topics:
    1. August 28. A Review of R Basics and the Linear Model.
      Reading: Homework:

    2. September 11. Models for Dichotomous Outcomes, Part 1.
      Reading:
      • Faraway, Chapter 2.
      Homework:
      1. Faraway, Chapter 2, Exercises 1-7 (Exercise 2.2, do not use the step function in part (b), use your own intuition),
      2. Find a dataset with a dichotomous outcome that you are interested in. Run an appropriate glm model in R and submit the output with a paragraph defending the model fit. Due 10/6/07.

    3. September 18. Conceptualizing Uncertainty and Inference. The Probability Model of Uncertainty.
      Reading:
      • Kennedy, Chapters 1, 2, and 3 (skip the technical notes).
      • TPM (The Political Methodologist) Volume 11, No. 2, articles: (1) Jackman, (2) Anderson, et al., (3) Gill (pages 20-26). Available at: The Society for Political Methodology webpage,
      Homework:

    4. September 25. The Likelihood Model of Inference.
      Reading: Homework:

    5. October 2. How to Handle Missing Data in Models. The EM Algorithm.
      Reading: Homework:
      1. Read in the dataset detroit.dat, and for the first eight variables (columns), plot a histogram on the same page (i.e. using the par(mfrow=c(2,4)) command or some variant), and overlay a density estimate. Here is an example using one of the canned datasets:

        data(faithful)
        par(mfrow=c(2,1),mar=c(4,4,2,2),oma=c(2,2,2,2))
        hist(faithful$eruptions,breaks=15,main="",
            col="light blue",xlim=c(0,6),
            freq=F,xlab="Duration of Eruptions",
            ylab="Frequencies")
        lines(density(faithful$eruptions),lwd=2)

      2. This problem deals with missing data in a linear model. First download and graph the data using:

        star98.missing <- read.table("http://artsci.wustl.edu/~jgill/data/star98.missing.dat",header=TRUE)
        par(mfrow=c(1,2),mar=c(3,3,3,3))
        plot(star98.missing$SUBSIDIZED.LUNCH,star98.missing$READING.ABOVE.50,pch="+",col="blue")
        abline(lm(star98.missing$READING.ABOVE.50~star98.missing$SUBSIDIZED.LUNCH),lwd=3)
        mtext(side=1,cex=1.3,line=2.5,"District Percent Receiving Subsidized Lunch")
        mtext(side=2,cex=1.3,line=2.5,"District Percent Above National Reading Median")
        plot(star98.missing$PTRATIO,star98.missing$READING.ABOVE.50,pch="+",col="blue")
        abline(lm(star98.missing$READING.ABOVE.50~star98.missing$PTRATIO),lwd=3)
        mtext(side=1,cex=1.3,line=2.5,"District Pupil/Teacher Ratio")
        mtext(side=2,cex=1.3,line=2.5,"District Percent Above National Reading Median")
        mtext(side=3,cex=1.5,outer=TRUE,line=-1,"Calfornia 9th Grade by District, 1998")

        Determine how much missing data there is and if there is a discernable pattern. Now use mice to run a new model. Also run a model omitting cases with missing data. What differences do you observe? Which is better?

      3. Explain what the following R does and why you would not want to do this.
        mi <- function(data.mat) {
          for (i in 1:ncol(data.mat)) {
           if (sum(is.na(data.mat[,i])) > 0) {
            print(paste("column",i,"has missing data"))
            mean.col <- mean(data.mat[,i],na.rm=TRUE)
            for (j in 1:nrow(data.mat)) {
             if (is.na(data.mat[j,i]) ==TRUE) data.mat[j,i] <- mean.col
            }
           }
          }
          return(data.mat)
        }
      4. Give citation information on three possible papers for your replication project. Include one sentence on the means for getting the data.

    6. October 9. Models for Dichotomous Outcomes, Part 2. Models for Count Outcomes.
      Reading: Homework:
      1. Load the Supreme Court decision data. The outcome variable, "fluidvot", indicates vote-switching for Supreme Court justices. In fully argued and decided Supreme Court cases, there are two occasions when the justices vote. The first, called the ``original vote on the merits,'' occurs in secret conference one to five days after oral arguments. The second, called the ``final vote on the merits,'' occurs much later as the majority and minority opinions are being written. Therefore Supreme Court justices have the opportunity to change their votes at any time between the conference and the announcement of the final decision. This may happen due to personal reflection, persuasion by other justices, or dissatisfaction with the drafting of an initially favored opinion. The unit of analysis is therefore votes by justices. The potential explanatory variables are: issue distance ("dist2"), coalitions exist ("coalspec"), Chief Justice in the minority ("mwcf"), a complex case ("complexity"), a landmark case ("landmark"), freshman justice ("freshyr1"), the justice is expert in this topic ("expert2"), Chief Justice ("cj"), strategic behavior by the Chief Justice ("cjstrat"), the size of the dissenting group ("dissent"), a unanimous case ("unanimou"), justice in the minority ("minority"). See Maltzman and Wahlbeck (APSR 1996) for additional details. Calculate a dichotomous regression model for using each of the appropriate link functions, and report the results. Do you find substantially different results with the three link functions? Explain. Which one would you use to report results? Why? Use the Akaike Information Criterion to select the model that minimizes the negative likelihood penalized by the number of parameters. Using the selected model perform diagnostics on the regression model, reporting any possible violations of underlying assumptions.

      2. Download the data on poverty in Texas using the command texas.fips <- read.table("http://artsci.wustl.edu/~jgill/data/texas.factors.dat",header=TRUE).
        These data are 1989 county level economic and demographic data for all Texas counties (``ERS Typology'') The dichotomous variable (first column) indicates whether 20% or more of the county's residents live in poverty. Possible explanatory variables include: GVT, a dichotomous factor indicating whether government activities contributed a weighted annual average of 25 percent or more labor and proprietor income over the three previous years; SVC, a dichotomous factor indicating whether service activities contributed a weighted annual average of 50 percent or more labor and proprietor income over the three previous years; FED, a dichotomous factor indicating whether federally owned lands make up 30 percent or more of a county's land area; XFR, a dichotomous factor indicating whether income from transfer payments (federal, state, and local) contributed a weighted annual average of 25 percent or more of total personal income over the past three years; POP, the log of the county population total for 1989; BLK, the proportion of Black residents in the county, and LAT, the proportion of Latino residents in the county.

        Produce a model with good fit, demonstrate your results with a standard table as well as graphs and diagnostics. Calculate response, Pearson, working, and deviance residuals for your model and compare. Compare your model with the null model using the AIC and a likelihood ratio test.

    7. October 16. Fall Break: No Class.
      Reading: Homework:
      1. Faraway, Chapter 3, Exercises 1-7. Do not use the step function in part (c) of Exercise 3.5, use your own intuition.
      2. Find a dataset with a count outcome that you are interested in. Run an appropriate glm model in R and submit the output with a paragraph defending the model fit.

    8. October 23. Midterm Exam .

    9. October 30. Models for Contingency Tables.
      Reading: Homework:
      1. Faraway, Chapter 4, Exercises 1-7.

    10. November 6. An Introduction to Time Series Analysis.
      Reading:
      • Faraway, Chapter 9.
      • Neal Beck's time series notes.
      Homework:
      1. Work on replication assignment.

    11. November 13. Models For Ordered and Unordered Categorical Data.
      Reading:
    12. Kennedy, Chapter 5, Section 15.3, 15.4.
Homework:
  1. Faraway Chapter 5, Exercises 1-6.
  2. Consider a proportional odds model using the logit link function with only one explanatory variable in addition to the constant. Express the odds ratio (i.e. not-logged) for a one-unit change in the explanatory variable. What does this simplify to?

  • November 20. Models For Ordered and Unordered Categorical Data (continued).
    Reading:
    • Faraway, Chapter 5.
    Homework:
    These data summarize the use of contraception by age for 3165 currently married women in El Salvador (final report of the Demographic and Health Survey conducted in El Salvador in 1985 (FESAL-1985). The married women are classified by age, grouped in five-year intervals, and current use of contraception, classified as sterilization, other methods, and no method. The setup in R is as follows:

    contraception.mat <- as.matrix(read.table("jgill.wustl.edu/data/contraception.dat",header=T))
    contraception.df <- data.frame(expand.grid(Response=1:3, "Age"=contraception.mat[,1]),
    "Freq"=as.numeric(t(contraception.mat[,2:4])))
    contraception.df$Response<- factor(contraception.df$Response)
    levels(contraception.df$Response) <- c("Sterilization","Other","None")
    contraception.df$Age<- factor(contraception.df$Age)
    levels(contraception.df$Age) <- c("15-19","20-24","25-29","30-34","35-39","40-44","45-49")
    contrasts(contraception.df$Age) <- contr.sum(7)

    Assume that the outcome categories are ordered as: $Sterilization=1$, $Other=2$, $None=3$. Assignment:
    1. In no more than 7-8 sentences summarize the findings from the ordered probit and ordered probit model, including the interpretation and reliability of the coefficients.
    2. Give the log-likelihoods for each model.
    3. Compare the AIC for these two models and indicate which one is better.
    4. For the proportional odds logistic model, calculate $P(Y <= 2)$ for age groups: 15-19 and 45-49.
    5. For the ordered probit model, calculate the $P(Y=2|X)$ age group 30-34.


  • December 4. The Exponential Family Form and GLM Theory, Residuals and Deviances, Quasilikelihood.
    Reading: Homework:
    1. Faraway Chapter 6, Exercises 1-5.

  • December 11. Turn in Replications.

  • December 18. Discussion: Replications, Political Methdology.