CDC YRBS: Bias in Censoring Questionnaires


This post is related to the Youth Suicide Rise project

Evidence indicates that girls whose 2019 YRBS questionnaire did not include the rape question were younger and less likely to be sexually active as well as to ever have had sex -- precisely the factors associated with higher risks of sexual violence victimization. The removal of such a group of girls from the random sample can produce substantial error in results, especially for questions related to sex.

As I've explained in CDC Misinformation on Girls and Violence, there were huge jumps in missing answers for all three of the YRBS sexual violence questions: from 2% to 18% for rape, 4% to 25% for sexual violence by anyone, and 7% to 24% for sexual dating violence. A note buried deep in one of the YRBS methods docs indicated this was due to removals of sexual violence questions from a large portion of YRBS questionnaires.

Data released by CDC does not, inexplicably, allow researchers to differentiate between when a student did not answer a question and when the student was not asked the question. This prevents researchers from deducing precise information about the groups of students excluded from answering certain YRBS questions.

We should be able, however, to calculate reasonable estimates. Since only about 2% of the student records had missing answer for rape in 2017, the 18% of girls with 'missing answer' for rape in 2019 should be overwhelmingly girls who were not asked the rape question at all.

Do the girls whose answers were missing  differ substantially from the rest of the girls in the YRBS 2019 sample?

Yes, they do differ substantially, at least per the results of my (admittedly provisional) analysis.

They are considerably younger: 22% of them are below 15 vs. 14% of the rest while only 6.5% of them are adults vs. 11% of the rest.

They are thus less likely to have had sex (32% vs. 36%) and to be sexually active (23% vs. 28%). These two factors are in turn associated with higher risks of sexual violence victimization.

The differences for 'censored' versus 'uncensored' girls are even higher, because the above comparison is unable to separate 'censured' girls (whose questionnaire removes the rape question) from girls with 'genuine' missing answers -- and of the latter group, 58% have had sex and 45% were sexually active in the 2017 YRBS data.

The effects of the questionnaire censorship are difficult to estimate, but I have little doubt it could easily be as much as 10% on results for questions strongly dependent on age and sexual behaviors.

This may not seem much, but we need to keep in mind that CDC last month raised alarms about a supposedly massive wave of violence engulfing girls and yet never even confirmed that its conclusion, presented as unquestionable fact, was based on statistically significant increases.

When we add the above evidence of sampling bias to the usual uncertainties of random sampling, any certitude that there were truly severe increases in sexual violence evaporates.

And yet much of the public, thanks to the amplification of CDC assertions by news media, is now convinced that the very real mental health crisis affecting adolescents was mainly caused by a concomitant wave of adolescent violence -- a wave that, in reality, may be largely or entirely illusory.

CDC needs to admit there are grave doubts about the validity of the sexual violence results and release data that will allow researchers to analyze the potential consequences of censured questionnaires on the results.

CDC also needs to explain why its YRBS questionnaires are being censored: is it due to some administrative decision by CDC itself or is it due to demands by states, school districts, or schools?

If the censorship is entirely due to CDC, then CDC needs to justify it and explain why it failed to be transparent about this matter.

If the censorship is induced from the outside, then CDC needs to admit that this is weakening the validity of YRBS and take steps to find solutions before its invaluable national survey ceases to be trustworthy.

This is a serious problem that needs to be addressed honestly and transparently by CDC officials. 

Note: the censored girls are much more likely to be Black than the rest and much less likely to be Hispanic than the rest, further creating potential problems for the interpretation of results broken down by race or ethnicity.

Note: I did not apply weights to my analysis since we are directly comparing two subgroups of the sample, rather than being concerned about the sample being representative of U.S. high school population. Furthermore, any more detailed analysis is rather futile given our current inability to separate censored girls from those with genuinely missing answers. Researchers will remain largely blinded on this matter as long as CDC declines to release the relevant data necessary for serious analysis.

Note:  I'm parsing an ASCII data file with grep and awk so I'm not 100% confident I did not make some stupid mistake but I'm unwilling to spend much more time on this given that it would take minutes to do easily had CDC released proper data identifying missing questions instead of conflating them with missing answers.

