Publication
Article
Oncology Live®
Author(s):
The term statistically significant is almost certainly beautiful music to the ears of clinical investigators and pharma/biotech companies. However, concern develops when one inquires how the most common test of significance, the P value, is used in clinical investigative efforts and whether at times this is more harmful than helpful within the domain of cancer medicine.
Maurie Markman, MD
The term statistically significant is almost certainly beautiful music to the ears of clinical investigators and pharma/biotech companies. This trial-related event likely means the opportunity to publish in a high-impact medical journal and, in the case of an industry-sponsored effort, may lead to regulatory approval or an increase in sales. And why challenge this scenario?
After all, although the world of statistics is a rather foreign place to most clinicians, we all know what the word significant implies. Webster’s New Collegiate Dictionary defines the term as “having meaning.” And who would argue that having meaning is not a good thing?
However, concern develops when one inquires how the most common test of significance, the P value, is used in clinical investigative efforts and whether at times this is more harmful than helpful within the domain of cancer medicine. The P value is a reasonable mathematical test that originated from a simple desire to show the difference between an observed experimental outcome and what might have been expected by chance alone. One might rationally suggest that today this simple test has been hijacked by those who desire a single absolute answer to the question of significance and clinical benefit.
If P <.05, there is a statistically significant difference in outcomes between 2 or more arms of a randomized trial, and if P >.05, this difference is insignificant. How much easier than this can it be?
To highlight just 1 aspect of this disquieting, overly simplistic reasoning regarding statistical testing, consider a report where 522 consulting biostatisticians were asked (390 responded) whether they had received what they considered to be “inappropriate requests to modify/falsify/underreport a statistical analysis to favorably enhance the study outcome.”1 Remarkably, 20% of the respondents reported such concerning requests.
How influential—one might even suggest magical—is this P <.05 number in clinical cancer research? Consider, for a moment, the efforts undertaken in a recent report of a phase III randomized trial that compared the cytotoxic drug combination of carboplatin and paclitaxel with these same 2 agents plus bevacizumab [Avastin] as second-line treatment of ovarian cancer.2 The P value for overall survival (OS) in the group receiving bevacizumab was not statistically significant (P = .056), but by identifying “incorrect treatment-free interval stratification data for 45 patients,” the investigators found it was possible to adjust the P value downward to P = .0447, thus creating a “statistically significant” survival outcome. Remarkably, this paper concluded that this so-called sensitivity analysis “indicates that this [experimental drug combination] might be an important addition to the therapeutic armamentarium in these patients.”
The fundamental question to be asked is: Would the beneficial value of this therapeutic intervention be objectively less certain if P = .056? Why not simply report the survival data (both progression-free and overall) and median landmark outcomes (1 year, 2 year, etc) and let clinicians and patients decide for themselves whether the results are clinically relevant?
Why would one assume that simply because it is possible to manipulate data to alter a P >.05 result to P <.05, somehow truth is now revealed. A final critical question is this: Does a P value of <.05 or >.05 meaningfully define the potential utility of a strategy for an individual patient or group of patients?
In this discussion, one should acknowledge that members of the academic statistician community have raised their own objections to the so-called P <.05 gold standard, with some suggesting the definition of statistical significance should be made even more stringent (P <.005).3 However, it is relevant to inquire whether making it even more difficult for a novel cancer therapeutic strategy to achieve the lofty goal of statistical significance helps patients with cancer or simply makes the math more impressive.
The basic issue here is the critical distinction between statistical significance and clinical relevance or value. Simply because a given outcome has achieved the statistical gold standard does not guarantee it has clinical value. To appreciate the lack of direct linkage, one need search no further than the remarkable conclusion from a phase III randomized trial that compared erlotinib (Tarceva) plus gemcitabine versus erlotinib alone in the management of metastatic pancreatic cancer.4 In this study, there was a statistically significant (P = .038) improvement in OS associated with adding erlotinib, but this translated to a median survival difference of only 10 days (6.24 vs 5.91 months) between the study populations.
Conversely, we also have multiple relevant examples of situations where data suggest the potential for impressive clinical utility, but because of limited patient numbers or a subset analysis, the gold standard of P <.05 fails to be achieved. For example, in a trial examining doxorubicin versus intensified doxorubicin plus ifosfamide in first-line management of advanced/metastatic soft-tissue sarcoma, the combination arm achieved a statistically significant improvement in progression-free survival (P = .003) and an almost doubling of the objective response rate (26% vs 14%; P <.0006).5 Yet, apparently because OS was not statistically significantly improved (P = .076), the investigators concluded that “our results do not support the use of intensified doxorubicin and ifosfamide for palliation of advanced soft-tissue sarcoma unless the specific goal is tumor shrinkage.” One must assume here that if the magical P <.05 had been attained, the author’s conclusion would have been very different.
As a final example, in a phase III randomized trial comparing palbociclib (Ibrance) plus fulvestrant (Faslodex) with fulvestrant plus placebo as second-line endocrine therapy for hormone receptor—positive, HER2-negative breast cancer, the combination regimen improved OS by a median of 6.9 months (34.9 vs 28.0 months),6 but presumably because the P value was .09, the investigators concluded “The differences in OS in the entire trial group were not significant.” Really?