Vulnerability testing of patch review processes in FOSS

The Linux kernel community is in the middle of a situation where a university submitted some bad patches to test how well the kernel patch review process would catch them. I've not followed all the details, but it seems that the crux of the problem is that the university did this research without the consent or knowledge of the kernel developers. (See, for example, LWN.)

The methodology of the research, as well as the ethics, can and should be questioned. However, I think the goal of testing the quality of review processes is valid. Here are my current thoughts about how it could be done well.

First of all, this needs be done with the consent and knowledge of the people reviewing changes. In online discussions, it has been claimed the reviewers knowing they're being tests invalidates the results. That is not true. Reviewers know that bad changes may arrive at any time, and if knowing they may receive test patches makes them be more alert and more careful, that's good. It's a useful thing to measure how many bad patches get through anyway.

Second, this kind of measurement should be an ongoing or repeated activity, not a one-off thing for one paper. If papers get produced and published, that's fine, but the real goal is to develop ways in which fewer bad changes get accepted. As when optimizing, you need to measure first, so it is when improving tools and processes.

Thus, I propose the following:

the project announces well ahead of time that bad patches will be generated and submitted, and that all reviewers are expected to participate
a trusted party collects change identifiers for bad patches before they're submitted for review
the bad patches themselves have no intentional markers as being part of the review research
the trusted party prevents bad patches that are accepted by reviewers from being merged, or at least reverts them afterwards
the researchers gather results and publish a report

Doing this regularly, or as an ongoing process, should tell the project where the weaknesses are. It should probably be done in all large projects. Small projects probably don't have the resources for it.

To address those weaknesses, various things may need to be done. Perhaps training of reviewers? Or maybe better tooling would help? I'm reluctant to think about that without data.