Scientific Papers

# A new approach to grant review assessments: score, then rank | Research Integrity and Peer Review

### Toy Examples

We provide 3 toy examples below which demonstrate the concept of the integrated score, as well as the key advantages of the Mallows-Binomial model in relation to score-only or ranking-only models.

#### Toy Example 1: Tie-Breaking Equally Rated Proposals using Rankings

The first toy example demonstrates how adding rankings may help break ties between equally or similarly-ranked proposals in a principled manner. Suppose there are 3 proposals and 16 judges, who rate each proposal using a 5-point scale (the integers between 0 and 4) and subsequently rank all proposals. Their ratings and rankings can be found in Table 1.

We see that proposals 1 and 2 have the mean rating of 0.5, yet all judges prefer proposal 1 to proposal 2. Next, we display what a ratings-only model, rankings-only model, and the Mallows-Binomial model would output:

1. 1

Ratings-Only Model: $$\{1 = 2\}\prec 3$$ on the basis of the mean ratings. There is no way of distinguishing proposals 1 and 2.

2. 2

Rankings-Only Model: $$1\prec 2\prec 3$$ since all judges provided this same ranking. There is no method of discerning that proposals 1 and 2 are essentially tied.

3. 3

Mallows-Binomial Model: Integrated scores $$p=[0.125, 0.125 + 10^{-8}, 0.750]$$ and induced preference ordering $$1\prec 2\prec 3$$.Footnote 1 This result allows us to see both a reasonable preference order and that proposals 1 and 2 are essentially tied.

Key Takeaway: Integrated scores estimated by the Mallows-Binomial model break a tie between proposals 1 and 2 by incorporating rankings. Although the preference order provides a local comparison between objects to demarcate their quality ($$1\prec 2\prec 3$$), the integrated scores simultaneously suggest the global comparison that proposals 1 and 2 are essentially tied.

#### Toy Example 2: Improved Decision-Making Even with Partial Rankings

The second toy example demonstrates the practicality of the proposed method in that even partial rankings may help discern proposals accurately and reliably while minimally increasing the difficulty of assessing proposals: Given many research proposals, it can be cognitively challenging to provide a complete ranking. Furthermore, it is usually more important to make accurate distinctions between the best proposals as opposed to the worst proposals. Suppose there are 8 proposals and 16 judges, who rate each proposal using a 5-point scale (the integers between 0 and 4) and subsequently rank their top-3 proposals. Their ratings and rankings can be found in Table 2.

We see that all judges are internally consistent and exhibit a variety of preferences. For many judges, rankings help to break ties between equally-rated proposals. On the basis of all available data, it is clear that proposal 1 is the most-preferred but the preference order of proposals 2, 3, and 4 is unclear. The remaining proposals are clearly in the bottom half and are unlikely to be funded. We now consider what a ratings-only model, rankings-only model, and Mallows-Binomial model would output:

1. 1

Ratings-Only Model: $$1\prec \{2=3=4\}\prec 5\prec 6\prec 7\prec 8$$ on the basis of the mean ratings. There is no way of distinguishing proposals 2, 3, and 4.

2. 2

Rankings-Only Model: $$1\prec 2\prec 3\prec 4\prec \{5,6,7,8\}$$ on the basis of the available rankings. There is no way of distinguishing proposals 5, 6, 7, and 8.

3. 3

Mallows-Binomial Model: Integrated scores $$p=[0.000 , 0.125, 0.125 + 10^{-8} , 0.125 + 2\times 10^{-8}, 0.438, 0.750, 0.875, 0.937]$$ and induced preference ordering $$1\prec 2\prec 3\prec 4\prec 5\prec 6\prec 7\prec 8$$.Footnote 2 This result allows us to distinguish proposals 2, 3, and 4 while noting that they are essentially tied.

Additionally, we display confidence-based ranking summaries for the Mallows-Binomial model and the Ratings-Only Binomial model. In the table, entries correspond to the estimated probability that each proposal is truly ranked in a given rank place. Results are calculated via the bootstrap and are limited to the first four places and first four proposals (Table 3).

We draw attention to the bootstrap ranking summary for proposal 2, which seems appropriate in the joint model (approximate tie for 2nd or 3rd place) but odd in the ratings-only model (approximate tie between 2nd and 4th place, but little weight for 3rd place). This strange behavior likely stems from the ratings of judges 9-12.

Key Takeaway: Integrated scores estimated by the Mallows-Binomial and their induced preference ordering draw nuanced distinctions among proposals using both ratings and partial rankings. Specifically, the integrated scores exhibit global comparisons, such as the approximate equivalence in quality between proposals 2, 3, and 4, while the induced preference ordering clarifies the local comparison that $$2\prec 3\prec 4$$. Using partial rankings makes the additional ranking task cognitively easier and still allows for separation of the top proposals, which is normally the most important task for the reviewers. Furthermore, the bootstrap ranking summaries for the joint model are much more sensible since they are “anchored” by the rankings, which distinguish similarly-rated proposals.

#### Toy Example 3: Analyzing Data with Conflicting Ratings and Rankings

The third toy example demonstrates the ability of the model to appropriately capture ratings and rankings even when reviewers provide conflicting information. That is, situations in which the ranking induced by the ordering of ratings is different from the observed ranking. In real data collected by the AIBS, we frequently observe such patterns. At the same time, this example includes a small minority of judges who provide “outlier” ratings and rankings, which differ from the group and heavily influence the mean ratings.

Suppose we have 3 proposals and 16 judges, who rank all proposals and rate each using a 5-point scale (the integers between 0 and 4). Their ratings and rankings can be found in Table 4.

We see that judges 1-14 (the vast majority) give essentially equal ratings to proposals 1 and 2 and rate proposal 3 far below them. However, judges 8-14 are inconsistent in that they each give proposal 1 a rating of 1 and proposal 2 a rating of 0, yet rank $$1\prec 2$$. Judges 15-16 think very poorly of proposal 1, however, and increase its mean rating significantly. Next, we display what a ratings-only model, rankings-only model, and Mallows-Binomial model would output:

1. 1

Ratings-Only Model: $$2\prec 1\prec 3$$ on the basis of the mean ratings. The small minority of judges who give proposal 1 a rating of 3 heavily skew the mean ratings and thus affect the outcome.

2. 2

Ranking-Only Model: $$1\prec 2\prec 3$$ since 14 of the 16 judges provided this same ranking.

3. 3

Mallows-Binomial Model: Integrated scores $$p=[0.156 , 0.156 +10^{-8} , 0.750]$$ and induced preference ordering $$1\prec 2\prec 3$$.Footnote 3 The integrated scores suggest that proposals 1 and 2 are essentially tied in the global sense, yet through the induced ordering appropriately suggest locally that $$1\prec 2$$. The outlier judges do not alter the preference ordering.

Additionally, we display confidence-based ranking summaries for the Mallows-Binomial model and the Ratings-Only Binomial model. In the table, entries correspond to the estimated probability that each proposal is truly ranked in a given rank place. Results are calculated via the bootstrap and are limited to the first three places and first three proposals (Table 5).

Key Takeaway: Integrated scores estimated by the Mallows-Binomial model and induced preference ordering are able to appropriately resolve judges who provide internally inconsistent ratings/rankings by recognizing that ratings of 0 and 1 for proposals 1 and 2 are essentially equal, given that 14 of the 16 judges ranked proposal 1 above proposal 2. This holds true even in the presence of two “outlier” judges who distort the mean ratings by rating proposal 1 very poorly. Additionally the joint model is more confident that $$1\prec 2\prec 3$$, where the ratings-only model is less confident and gives much more probability to $$2\prec 1\prec 3$$.

### Case Study: A panel grant review data analysis

#### AIBS Ranking Procedure

In an effort to explore the usefulness of both rating and ranking in real-world funding decisions, the American Institute of Biological Sciences (AIBS) implemented a new procedure in the review of proposals submitted to a biomedical research funding agency. In this annual competition, AIBS reviewed proposals submitted to a 2020 funding announcement describing 2 year awards that are 100-150K in budget. The historic success rates for funding hover around 10 percent. As in previous years, reviewers were recruited based on expertise levels relative to the proposals, as well as on previous review experience and diversity balance. Reviewers were given access to proposal files and evaluation forms via an online system several weeks before the panel meeting and were required to enter preliminary comments and scoring into the system in advance of a teleconference review meeting. Each application was evaluated by two reviewers in advance of the meeting, who were asked to provide a score for the overall scientific merit based on the following application criteria: Impact/Significance, Innovation, Approach, Feasibility, and Investigators/Facilities. The overall scientific merit was scored on a scale from 1 (best) to 5 (worst); one decimal place is allowed in the scores (Table 6).

At the meeting, assigned reviewers presented their initial critiques to the panel, then the panel discussed (discussion is inclusive of all panelists who don’t have a conflict), and then all panelists made their final scores in the system after discussion was ended. These procedures have been the standard for the history of the program while AIBS was reviewing these proposals.

In 2020, AIBS added an additional ranking procedure to the assessment process. To collect ranking data, at the end of all proposal discussion, reviewers were provided with a link in the scoring system with a list of all the final average panel scores associated with each proposal (reviewers were blinded to any proposals where they had a conflict of interest). Thus, the list of proposals was different for each reviewer, depending on their conflicts in the review system. Reviewers were then given a link to a GoogleTM form, allowing them to look at all the proposals on the panel and select their “top six” that they would like to see funded. The question was constrained in that only one proposal can be chosen for each ranking position (e.g. first place) and only 6 choices were allowed. It should be noted that the scoring process was not altered in any way; the ranking process occurred after all proposals were scored and access to online scoresheets were locked. Only six rankings per reviewer were collected, as the focus was on ranking projects that each reviewer deemed worthy of funding if they were allowed to choose; and it was deemed impractical to rank all of the proposals. The number of ranked proposals was determined by looking at the historical success rate for this program (3 proposals for a panel of this size) and doubling it so we could examine rankings of both proposals likely to be funded as well as those slightly farther from the funding threshold.

To create the final proposal priority list, both scores and rankings needed to be considered. As mentioned previously, while scores are important indicators of the global scientific quality relative to the goals of the funding program, the rankings are more valid for indicating local proposal quality relative to the other proposals [35]. While rankings alone can be used to determine funding priorities, as they are zero sum and allow for clear discrimination between proposals, without ratings it is not known whether any of the proposals approach the standard of excellence. If only ratings are used (as is often the case), some scores can be close or identical, making it difficult to determine priority order. In order to combine these two information sources to create a funding priority list, a statistical model was chosen to apply to the data to facilitate interpretation.

### Data Analysis

Panel 1 from the 2020 AIBS program has 12 reviewers and 28 proposals. Of the 12 reviewers, 11 were “full” reviewers and 1 was a “telecon” (TCON) reviewer, meaning he/she/they was asked to rate only 1 proposal and not rank. Ratings were provided on a 1 to 5 scale in single decimal point increments, which we have converted to the integers between 0 and 40 (a 41-point scale). Subsequently, reviewers were asked to provide a top-6 ranking.

The data have a few intricacies. First, some reviewers had conflicts of interest (COI) with one or more proposals. Specifically, one reviewer had a single COI while two reviewers each had two COI; 23 proposals had no COI while 5 each had one COI. Reviewers were not allowed to rate or rank proposals with which they had a COI. Beyond COI, some ratings and rankings were missing. There were 25 instances of missing ratings and one missing ranking among the “full” reviewers; the “TCON” reviewer provided only one rating and no ranking. In this analysis, we ensure COI missingness does not influence the likelihood of a proposal ranking and treat other missingness as missing completely at random.

Figure 1 displays exploratory plots of the ratings and rankings from this panel. We notice a variety of rating patterns among the proposals. Some proposals have consistent ratings, while others exhibit wide variance. There are a few proposals which clearly have the best ratings, while others can be immediately seen as being unlikely to receive funding. Overall, the reviewers did not use the full rating scale, instead limiting themselves to the range [3, 30], which corresponds to the range [1.3, 4] on the original scale. For rankings, we notice that only 11 of the 28 total proposals made any of the 10 provided top-6 rankings. There is no clear consensus by looking at the top few rank places. However, we see that proposal 17, 19, and 25 frequently appear in first, second, and third places; proposal 4 appears in 5th place for over half the reviewers who provided rankings.

We now display integrated scores estimated by applying a Mallows-Binomial (MB) model to the AIBS data. In order to draw attention to the utility of the model, we additionally provide results from a traditional method, which we call the “Mean Ratings” (MR) model. In this model, we simply take the mean ratings from each proposal and standardize them to the unit interval. The order of proposals based on their mean ratings is thus the estimated ranking of the proposals. We display results in Table 7 and Fig. 2.

We see in Table 7 that the estimates of integrated scores between the MB and MR models are similar. However, the preference ordering deviates in the MB model in a few cases. This distinction is made clear in Fig. 2, in which we can see directly that the MB model breaks ratings ties in 7th/8th place and 11th/12th places. Additionally, we see a reordering among proposals 3, 6, 13, and 16 between the Mean Ratings and Mallows-Binomial models: Although proposal 6 receives a slightly better mean rating than proposals 3, 13, and 16, its rankings are comparatively worse enough to make it receive a worse integrated score in the joint Mallows-Binomial model. We note that in the case of proposals 27 and 28, each received the same mean rating and neither was ranked by any judge. As such, neither model is able to break their tie. The model is also unable to break a tie in mean ratings between proposals 19 and 25. These proposals received unique rankings among the judges, yet precisely half of the reviewers preferred 19 to 25 while the other half preferred 25 to 19. As a result, the data do not allow for demarcation between these proposals on the basis of ratings or rankings. We turn to uncertainty estimation in order to make funding decisions between these two proposals.

Next, we estimate uncertainty in rank place among the proposals in Tables 8 and 9. Table 8 displays the probabilities of proposals entering first, second, third, or fourth place in the Mallows-Binomial and Mean Ratings models. In the Mallows-Binomial model, there is more certainty in proposal 19 being ranked above proposal 25, which may help us break the tie in integrated scores between the proposals. In comparison to the Mean Ratings model, the Mallows-Binomial provides more evidence that proposals 17, 19, and 25 have similarly high quality. Based on the original data, these results seem probable. For the purpose of making decisions, Table 9 displays the probabilities that proposals should receive funding conditional on the number of proposals the funding agency can support.