Get sample size updates by email

Receive educational sample size content

Get sample size updates by email

Receive great industry news once a week in your inbox

Get sample size updates by email

Receive educational sample size content

Get sample size updates by email

Receive great industry news once a week in your inbox

and power analysis questions

Sample size determination is the process of determining the appropriate number of subjects to include in a study. The appropriate sample size is defined as the minimum sample size required to achieve an acceptable chance of achieving a statistical criterion of interest (e.g. statistical significance, maximum interval width) for a proposed study. The methodology used to determine the appropriate sample size varies depending on the type of testing procedure used, the underlying assumptions etc. In addition, most sample size calculations will require pre-study estimates for the outcome of interest and additional nuisance parameters.

For a detailed introduction to sample size determination, see the following video:

It is important to obtain an appropriate sample size estimate for many reasons. Most importantly, without sufficient sample size the chance of a study coming to an incorrect conclusion (positive or negative) and making large errors will be unethically high.

The main goal of inferential statistics is to generalize from a sample to a population and an accurate inference is more likely if the sample size is large since larger random samples more closely approximate the population. However, any study requires the expenditure of resources such as money, subjects and time and thus the sample size cannot be increased without consideration for these constraints. Thus, a balance is required between the statistical validity of a study and the practical constraints of study design.

It is also important to use an appropriate sample size from an ethical point of view. It would be unethical to expose human subjects or lab animals to unnecessary risk if the study has little realistic chance of producing conclusive results due to an insufficient sample size but we also need to weigh the cost of exposing subjects to any risks, such as severe side-effects, that may be present.

For the above reasons, a sample size justification is considered a standard requirement for trial design when seeking approval from regulatory agencies, such as the US Food and Drug Administration (FDA), for confirmatory clinical trials or when publishing study results in high-impact academic journals. Utilizing validated sample size software, such as nQuery, can be useful when seeking quicker approval for your sample size estimate.

To calculate sample size in the setting of a clinical trial, Statsols’ recommend five essential steps which should be followed when conducting a sample size determination.

For a detailed explanation of these steps, see the following video:

Hypothesis testing is the procedure of assessing whether sample data is consistent or otherwise with statements made about the population of interest. The most common hypothesis testing framework is known as the null hypothesis significance testing (NHST) and terms NHST and hypothesis testing are often used interchangeably. NHST assesses the probability of achieving a given result (or more extreme) in a study assuming the null hypothesis (including any associated assumptions) is true. The null hypothesis will usually indicate the “failure” state of a study (e.g. no difference between treatment and placebo).

In clinical trials, a typical primary study objective is the evaluation of the superiority of a drug product compared to placebo or a standard treatment (e.g. RLD). In other cases it may be of interest to show that the study drug is as effective as, superior to, or equivalent to an active control agent or a standard therapy. The objective of the study will affect whether you are doing inequality, non-inferiority or equivalence hypothesis testing which are described later.

A null hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. That is to say it is usually the hypothesis that the researcher is trying to disprove.

For example, imagine a study attempting to answer the question “Are teens better at math than adults?”. If trying to determine whether age has an effect on mathematical ability, an example of a null and alternative hypothesis of interest would be the following:

Null Hypothesis: The mean difference between the mathematical ability of teens and adults is equal to zero.

Vs.

Alternative Hypothesis: The mean difference between the mathematical ability of teens and adults is not equal to zero.

We reject the null hypothesis, including all its assumptions, when it is inconsistent with the observed data. For example this inconsistency may be determined through statistical analysis and modelling. Typically, if a statistical analysis produces a p-value that is below the significance level (α) cut-off value we have set (e.g., generally either 0.05 or 0.01), we reject the null hypothesis. If the p-value is above the statistical level cut-off value, we fail to reject the null hypothesis.

Note that if the result is not statistically significant this does not prove that the null hypothesis is true. Data can suggest that the null hypothesis is false but just may not be strong enough to make a sufficiently convincing case that the null hypothesis is false. Also note that rejecting the null hypothesis is not the same as showing real-world significance.

The p-value is the probability of finding the observed, or more extreme results when the null hypothesis of a study question is true – the definition of ‘extreme’ depends on how the hypothesis is being tested. The p-value is also described in terms of rejecting the null hypothesis when it is actually true, however, it is not a direct probability of this rejection state.

The most commonly conducted statistical analysis is a significance test of the null hypothesis of inequality. In this scenario, the researchers are interested in investigating where there is a difference or inequality between a study group or intervention against another group(s) or a predefined standard value. Note that this hypothesis is often called a “superiority” hypothesis but should not be confused with the superiority by a margin (a.k.a. supersuperiority) hypothesis.

Non-inferiority testing uses a similar significance testing approach as for an inequality test of a one-sided null hypothesis for no difference except that in non-inferiority testing the null hypothesis (of inferiority) specifies that the difference between interventions is less than a specified inferiority limit rather than the no difference value used in inequality testing. The alternative hypothesis of non-inferiority is that the difference between interventions greater than the non-inferiority margin.

Note that the inverse of the non-inferiority hypothesis, known as superiority by a margin or supersuperiority, evaluates whether a new treatment is greater than the standard treatment by a specified margin. The null hypothesis is rejected if the difference is sufficiently above the superiority margin. This should not be confused with the common usage of “superiority testing” for the case of a testing a no difference (inequality) hypothesis.

In equivalence testing, the test is a composite one in which it is testing whether the difference is not far against equivalence (i.e. the treatment and control are the same) in either direction. The alternative hypothesis is typically that the effect size is zero i.e. the interventions are equivalent against the null hypothesis of inequivalence in either direction.

As this is a composite hypothesis this requires the simultaneous testing of two hypotheses. The most common methods for equivalence testing are the two one-sided tests (TOST) approach, where each one-sided test is conducted independently and the null hypothesis is rejected only if both are significant, and the confidence interval approach, where a confidence interval is constructed and equivalence is found for if it is fully contained within the lower and upper equivalence limits. Note that for a given significance level, the 2-sided confidence interval would be constructed at two times the significance level (e.g. a 0.05 significance level corresponds to a 90% confidence interval)

The degrees of freedom in a statistical calculation show how many values involved in a calculation have the freedom to vary. The degrees of freedom can be calculated to help ensure the statistical validity of chi-square tests, t-tests and F-tests. These tests are frequently used to compare observed data with data that is expected to be attained according to a specific hypothesis.

In clinical research, to determine an accurate and reliable sample size calculation, an appropriate statistical test for the hypothesis of interest in derived from the study design and questions of interest. In most research we are interested in evaluating the relative effectiveness of a proposed treatment or intervention or understand the effect of some exogenous factor. The sample size determination then relates to achieving an acceptable probability of finding a significant result (i.e. reject the null hypothesis, achieve a significant p-value) given the desired effect really existed. This probability is known as statistical power.

Power is defined as the probability of rejecting the null hypothesis given that it is false. Power requires the specification of an exact alternative hypothesis value. In clinical trials, it is the probability that the trial will be able to detect a true effect of the treatment of a specified size or larger. Practically, it is the likelihood that a test will detect the specified effect or greater when that effect exists to be detected i.e. achieve a significant p-value in the presence of a real effect.

The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis (H0). Power relates to the specific alternative hypothesis chosen (H1). As a result the calculations involved in calculating power vary depending on the statistical test chosen.

Sample size determinations estimate how many patients are necessary for a study. Power calculations determine how likely you are to avoid a type II error given an assumed design, including the sample size, and study outcome. It can be shown that power will generally increase as sample size increases. As other planning parameters are generally fixed (e.g. significance level) or based on an estimate (e.g. variance) or clinically desired value (e.g. effect size), the sample size is the obvious target to adjust to achieve the appropriate power.

For reference type I errors, also known as false positives, occur when you reject a true null hypothesis. Type II errors, or false negatives, occur when you fail to reject a false Null Hypothesis. In statistics, the probability of making a Type I error is denoted by the Greek letter alpha (α), and the probability of making a Type II error is denoted by Greek letter beta (β).

As the sample size increases, so does the power of the significance test. This is because a larger sample size constricts the distribution of the test statistic. This means that the standard error of the distribution is reduced and the acceptance region is reduced which in turn increases the level of power. Acceptance region here refers to the range of values in which the test statistic may fall where one fails to reject the Null Hypothesis. As this region becomes smaller, the probability that a false Null hypothesis will be rejected increases.

Sample size also strongly influences the P-value of a test. An effect that fails to be significant at a specified level of significance in a small sample can be significant in a larger sample.

The two most common targeted powers in clinical studies are 80% and 90%.

Generally, a level of power equal to 90% is preferred for most clinical trials. This corresponds to having a 10% probability of making a type II error. Note that larger levels of power require larger sample sizes to decrease the probability of making a type II error and thus increase the level of power.

90% is recommended as it provides both an optimism bias adjustment, say due to pre-study effect sizes often being overestimated, and a two successful study requirement adjustment (a standard requirement for confirmatory clinical trials), as the power for two statistically independent studies both being significant will equal 81% (i.e.100(0.9^2)). When observations are expensive or difficult to obtain, a lower value of 80% power is acceptable.

Note that a generally acceptable type II error rate of 0.2 was proposed by Cohen, who postulated that a type I error was more serious than a type II error. Therefore, he estimated the type II error rate at 4 times the type I error rate: 4 × 0.05 = 0.20. This value is arbitrary but the 80% and 90% levels of power have been copied and become the standard levels used by most researchers. Using a level of power which is appropriate for the study in question is an important and useful discussion item to have with study stakeholders.

The effect size is the difference in the primary outcome value that the clinical trial or study is designed to detect. In general, the greater the true effect of the treatment the easier it is to detect this difference using the sample selected for your trial. As a result, a larger effect size will also increase your power i.e. you will have more power to detect a larger effect size and have smaller power to detect a smaller effect size. There two types of effect size: standardized effect sizes and unstandardized effect sizes.

To determine the required sample size to achieve the desired study power, or to determine the expected power obtainable with a proposed sample size, one must specify the difference that is to be detected. Statistical power is affected significantly by the size of the effect as well as the size of the sample used to detect it. In general, bigger effects sizes are easier to detect than smaller effect sizes, while large samples offer greater test sensitivity than small samples.

A standardized effect size measures the magnitude of the treatment effect on a unit free scale. This scale will usually be in magnitudes related to the variance. This allows a more direct and comparable measure of the expected degree of effect across different studies. There are many standardized effect sizes. Some of the more common examples are Cohen’s d, partial Eta-squared, Glass’ delta and Hedges’ g.

Cohen's d is often considered the appropriate effect size measure if both groups have similar standard deviations and are of the same size. Partial Eta-squared can be useful during ANOVA tests. Glass' delta is another measure if each group has a different standard deviation and uses only the standard deviation of the control group. Hedges' g, which provides a measure of effect size weighted according to the relative size of each sample, is an alternative where there are different sample sizes.

The unstandardized effects are the raw treatment effects. Usually they are just the difference or ratio between means, rates or proportions. This effect size will be on the same scale as the treatment measurements (e.g. mmHg, median survival). Unstandardized effect sizes are usually preferable for clinical trials.

Many use the following guidance on whether to use a standardized or unstandardized measure of effect size: If the units of measurement are meaningful on a practical level then an unstandardized measure is generally preferred to a standardized measure. This is because the unstandardized effect size makes clear all information and planning parameters being used while planning the study and calculating the sample and also ensures the effect size is on a scale understandable to all stakeholders.

Selecting an appropriate effect size is one of the most important aspects of planning a clinical trial. If the effect size you use in your calculation is smaller than the true difference, a larger sample size than necessary will be required to detect the difference. On the other hand if the effect size used in the calculation is larger than the true effect then the sample size calculated before the trial will not be enough to achieve the target power.

The effect size also lays out the real quantitative objective of the study by putting a numerical value on your study question(s). In most studies, the effect size is the main finding of a quantitative study. While a p-value can inform the reader on whether an effect exists, it does not contain explicit information on the magnitude of the effect. In reporting and interpreting studies, both the substantive significance (effect size) and statistical significance (p-value) are essential results to be reported and this is reflected in the sample size determination process.

There are various approaches, options and interpretations for how to find an appropriate value for the effect size. There are generally two main approaches for specifying your effect size: use a clinically relevant target difference, use the expected difference. There arguments in favour of both and ideally the selected effect size could fulfil the requirements of both these options meaning that the selected effect size should have the correct balance between being clinically relevant and plausible. However, using a clinically relevant difference is usually preferred in clinical trial planning.

In addition to these, new methods have appeared which deal with the influence of uncertainty around the effect size. Two methods which are covered in nQuery are Bayesian Assurance and Unblinded Sample Size Re-estimation. In Bayesian assurance the effect size is parametrized as a distribution of values rather than a single value. Unblinded sample size re-estimation allows the effect size to be updated during a study based on the data collected during the study up to that point. These two methods focus on minimizing and accounting for uncertainty around the effect size.

Clinical significance pertains to the practical real life importance or benefits of a treatment effect. While much research focuses on statistical significance, in reality clinicians, clinical researchers, governments and other stakeholders are interested in clinically significant effects. A study outcome can be statistically significant, but not clinically significant, and vice‐versa. Unfortunately, clinical significance is often not well defined and is domain-specific and thus many mistakenly conflate statistical significance with clinical relevance.

The recommendation from many experts is to power a study for the minimum difference worth detecting and this can be interpreted as a lower bound for the effect size that would still be considered clinically relevant. Ideally this lower bound would be defined using pre-existing evidence, health economic models and expert opinion.

Clinically relevant changes in outcomes are identified by several similar terms including the “minimal clinically important difference” (MCID), “clinically meaningful difference” (CMD) and “minimally important change” (MIC). In general, these terms all refer to the smallest change in an outcome score that is considered “important” or “worthwhile” by the practitioner or the patient and/or would result in a change in patient management.

There are many formal methods to determine what the expected value for the effect size should be. Examples of the most common methods would be pilot studies, literature review, expert elicitation (including formal frameworks such as SHELF) and standardized effect sizes (small, medium, large).

The pilot study method is popular but ideally should be used in conjunction with other methods as smaller sample sizes are generally used in these studies so often the error in the measurement produced is too large to be depended on solely.

An effect size is a clinically defined measure that should ideally be specific to the study in question which means that interpretations on an effect size may vary. For example, if the odds of a particular trait being present in a group is of interest then researchers may decide that the odds ratio between groups be chosen as the measure of effect size and they may decide on a certain index to facilitate a standardized interpretation of results for this measure of effect size.

These effect size discussions are a good way of “bridging the gap” between the statistician(s) and the research/design team for a study as it allows the exchange of information relating specifically to the study in question that may affect the interpretation of results obtained.

However, in the absence of an agreed measure of effect size a general rule of thumb regarding effect sizes was provided by Cohen (1988) where he suggested that a standardized effect size (Cohen’s d) in the region of 0.8 represents a large effect size. He also suggests that effect sizes of 0.2 and 0.5 be considered ‘small’ and ‘medium’ effect sizes, respectively.

However, Cohen himself warns: "The terms 'small,' 'medium' and 'large' are relative, not only to each other, but to the area of behavioural science or even more particularly to the specific content and research method being employed in any given investigation. In the face of this relativity, there is a certain risk inherent in offering conventional operational definitions for these terms for use in power analysis. This risk is nevertheless accepted in the belief that more is to be gained than lost by supplying a common conventional frame of reference which is recommended for use only when no better basis for estimating the effect size index is available."

A confidence interval is an interval which contains a population parameter (e.g. mean) with a given level of probability. This level of probability is often denoted as the confidence level.

A confidence level refers to the percentage of all possible samples that are expected to include the true population parameter. For example, a 95% confidence interval is a range of values that contains the true parameter value with 95% probability. This is often framed in terms of repeated sampling and under this framework 95% confidence would be akin to expecting that in 10000 samples the true parameter value is likely to fall within the given intervals 9500 times.

The confidence interval width is the distance from the lower to the upper limit of the confidence interval. The smaller the confidence interval the more confidence we will tend to have in our point estimate and more precise the estimate will be considered to be. The usefulness of the confidence interval depends on its width/precision. The width depends on the chosen confidence level and on the standard deviation of the quantity being estimated as well as on the sample size.

For a two-sided interval the width of a confidence interval is defined as the distance between the two interval limits. However, in the one-sided cases the width of the confidence interval is often defined as the distance from the parameter estimate to the limit of the interval despite technically the interval will go from the lowest possible lower and upper limit to the upper and lower limit of an upper or lower one-sided interval respectively.

A larger sample will tend to produce a better estimate of the population parameter, when all other factors are equal. Increasing the sample size decreases the width of confidence intervals, because it decreases the standard error. This can also be phrased as increasing the sample size will increase the precision of the confidence interval.

In a study in which the researcher is more interested in the precision of the estimate rather than the testing a specific hypothesis about the estimate, the confidence interval approach is more informative about the observed results than the significance testing approach. Sample size which targets the precision of the estimate uses the confidence interval as a method to define the specific precision of interest to the researcher. Common cases where this may be true include survey design and early-stage research.

Bayesian statistics is a field of statistics based on the Bayesian interpretation of probability where probability expresses a degree of belief in an event which changes as new information is gathered rather than a fixed value based upon frequency or propensity. The degree of belief will be based on our prior knowledge about the event, such as the results of previous experiments, or personal beliefs about the event and the data up that given point.

Bayesian analysis is becoming a more and more popular form of statistical analysis for clinical trials. This is because it offers the ability to integrate domain knowledge and prior study data in order to improve the efficiency and accuracy of testing and estimations. There are also arguments that the Bayesian framework better reflects real-world treatment decision making.

Q.

Bayesian probability is regarded as reasonable expectation representing a state of knowledge or as quantification of a personal belief regarding some outcome. Bayesian probability contrasts to the commonly used frequentist interpretation of probability which is based on the frequency or propensity of some outcome. Bayesian probability represents a level of certainty relating to a potential result or idea. This is in comparison to a frequentist probability that shows the frequency with which a certain result will occur over any amount of trials.

One example of Bayesian probability in use is rolling a dice: Traditional frequency theory orders that, if you throw the dice six times, you should roll a six once. There may be differences, but it will average out eventually. This is where Bayesian probability is different. A Bayesian specialist watching a game of dice in a casino will probably begin with the same 1 in 6 chance. However, he will notice that the dice is showing sixes more than expected, and changes his belief.

There are two broad areas where Bayesian approaches are being applied to sample size determination.

These are:

- Sample Size for Bayesian Methods: This is when you are planning to use a Bayesian test or statistical method and require a sample size estimate to get the desired “success” probability for this method. Examples of this would be determining the required sample size for a sufficiently high Bayes Factor, sample size for a sufficiently narrow credible interval and sample size methods based on decision-theoretic approaches using approaches such as utility functions.
- Using Bayesian approaches to improve existing sample size methods: These methods integrate Bayesian analysis and thinking when the planned analysis is frequentist in order to add greater context to the frequentist sample size determination or improve upon the characteristics of the current sample size method. Examples of this are Bayesian Assurance, Predictive Power, Posterior Error Methods and using Bayesian Adaptive Designs adaption criteria.

Bayesian Assurance is the unconditional probability that the trial will yield a positive result (usually a significant p-value) given a prior distribution for one or more planning parameters (commonly the effect size) used in a sample size determination. The assurance equals the expectation for the power averaged over the prior distribution of the unknown parameter(s). For this reason, assurance is often referred to as “Bayesian power”.

The assurance provides a useful estimate of the likely utility of a clinical trial, creates an estimate for power which accounts for intrinsic pre-study uncertainty in our planning parameters (akin to a formal sensitivity analysis) and could provide an alternative method to frequentist power for finding the appropriate sample size for a study.

For a more detailed explanation of Bayesian Assurance, see the following video:

Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis? Why isn't Sensitivity Analysis formalized? How Bayesian Assurance works? Why use in both Frequentist or Bayesian analysis? How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis plus moreQ. What are the differences between a confidence interval and a credible interval?

Posterior credible Intervals are the most commonly used Bayesian method used in interval estimation. Credible Intervals are seen by many as being superior to confidence intervals as they give the probability that the interval contains the true value of the parameter. This is often seen as the more naturalistic interpretation of what a statistical interval should do.

When confronted with the problem of trying to specify a reasonable statistical interval given an observed sample, the frequentist and Bayesian approaches differ.

A confidence interval is the frequentist solution to this problem. Under this approach, you assume that there is a true, fixed value of the parameter. Given this assumption, you use the sample to get to an estimate of this parameter. An interval is then constructed in such a way that the true value for the parameter is likely to fall in this interval with a given level of confidence (say 95%).

A credible interval on the other hand is the Bayesian solution to the above problem. It is defined as the posterior probability that the population parameter is contained within the interval. In this case the true value is, in contrast to the above, assumed to be a random variable. In this way the uncertainty about the true parameter value is captured by assuming a certain prior distribution for the true value of the parameter. This prior distribution is then combined with the obtained sample and a posterior distribution is formed. An estimate of the true parameter value is then obtained from this posterior distribution. A credible interval is then formed to contain a given proportion of the posterior probability for the parameter estimate. This can be interpreted that a given interval has a given probability of containing the true parameter value.

There are many different methods for finding the appropriate sample size for the precision of a credible interval but including methods based on work by Adcock (1988) and Joseph and Belisle (1997). These methods focus on integrating uncertainty into the estimation of the variance and the specific method chosen depends on both the desired selection criterion and the estimation methodology. However, the same basic relationship between increasing sample size and reducing that holds for confidence interval will also generally hold for credible intervals.

Adaptive trials are any trial where a change or decision is made to a trial while it is still on-going. Adaptive trials enable continual modification to the trial design based on interim data which in turn can allow you to explore options and treatments that you would otherwise be unable to which can lead to improvements to your trial, based on data as it becomes available. Adaptive designs are generally pre-specified and built into the initial trial design.

Examples of adaptive designs are group sequential designs, sample size re-estimation, enrichment designs, arm selection designs and adaptive allocation designs.

For a more detailed explanation of what adaptive clinical trials are, see the following video:

Q.

Adaptive trials are seen by many to be a very valuable addition to clinical trial design toolkit as they give control to the trialist to improve a trial based on all the information as it becomes available. Adaptive design facilitates these improvements to a trial in a principled and pre-specified framework and thus can changes can be done without impacting trial legitimacy. This conceptually should allow our trial to be closer to optimal trial if the results had been known beforehand and thus give better and potentially more efficient inferences.

As a result, adaptive trials can decrease the costs involved in clinical trials by increasing success rates, allow greater flexibility in adding analyses and trial arms and allowing trials to end earlier if the results are either very promising or unpromising. This is of particular importance today as the success rate of clinical trials in general has become lower and the costs associated with clinical trials, particularly confirmatory phase II clinical trials, have escalated over the last 30 years.

For a more detailed explanation of what the advantages of adaptive clinical trials are, see the following video:

Q. What are the potential disadvantages of adaptive design?

Adaptive trials may also involve complex and different statistics and estimates than those commonly used in clinical trials and so may require specialized software and expertise to implement, which may incur additional costs up-front costs. For example, most adaptive designs used in clinical trials will require in-depth simulation to evaluate the design’s operating characteristics and expected Type I error rate. This may also mean results from an adaptive designs may not be directly comparable to those from a fixed term trial. As a result certain inter-trial comparisons and meta-analysis may be difficult, which may cause problems from a regulatory point of view or for general understanding.

Adaptive trials will possibly incur additional logistical costs. Bias and unblinding are major issues as ensuring that the blinding is kept may incur additional costs and it is also a major risk to the integrity of the trial. As more people will be required to view the interim data as the trial progresses in order to make decisions on an adaptive basis then the situation arises where there is more scope for changes that could negatively impact the amount of bias in the trial. In large clinical trials, this will place more emphasis on working collaboratively with the relevant regulatory agency and the independent data monitoring committee (IDMC).

For a more detailed explanation of what the advantages and disadvantages of adaptive clinical trials are, see the following video:

Q.

Adaptive designs allow clinical trials to be more flexible by utilising results accumulated in the trial to change the trial’s course. Trials with an adaptive design are usually more efficient, informative and can be more ethical than trials that have a traditional fixed design because they often make better use of resources such as time and money, and may require fewer participants.

Adaptive designs can also be potentially less efficient than a fixed term trial or simple adaptive design if designed poorly. For example, sample size re-estimation designs which lead to higher average sample sizes for minimal success rate increases or adaptive selection designs which include additional arms with minimal prior chance of succeeding.

Overall, adaptive designs when properly considered and well-planned have considerable scope for increasing trial efficiency but the additional flexibility could mean more opportunities for making poor design decisions.

Group sequential designs are the most widely used type of adaptive trial in confirmatory Phase III clinical trials. Group sequential designs differ from a standard fixed term trial by allowing a trial to end early based on pre-specified interim analyses for efficacy or futility. Group sequential designs achieve this by using an error spending method which allows a set amount of the total Type I (efficacy) or Type II (futility) error at each interim analysis. The ability to end the trial can help reduce costs by creating an opportunity to get early approval for highly effective treatments and abandoning trials which have shown very poor results thus far.

The term 'futility' refers to the inability of a clinical trial to achieve its aims, such as, ending a clinical trial when the interim results suggest that it is highly unlikely to achieve statistical significance. This can save resources which can then be used in other more promising studies.

Sample size r-estimation (SSR) is a type of adaptive trial where one can change the sample size if required. Sample size determination is a pre-trial process which will be conducted on the basis of inherently uncertain planning parameters (e.g. variance, effect size) and thus changing the sample size based on improved interim estimates for these parameters is an obvious adaptation target.

SSR can ensure that sufficient power is obtained for promising results in an underpowered study or could ensure more patients receive the superior treatment or transition directly from one trial phase to another. Combined this may reduce the use of resources and time or improve the likelihood of success of the trial.

There are two primary types of SSR: Unblinded SSR and Blinded SSR. The main differences between these designs being that they differ on whether the data is blinded or not and the planning parameter targeted to for an improved interim estimate.

For a more detailed explanation of what SSR is, see the following video on sample size re-estimation:

Blinded sample size re-estimation (SSR) design is a flexible design with the main purpose of allowing the sample size of a study to be reassessed mid-way into the study to ensure sufficient power without unblinding the interim data i.e. without allowing trialist know which treatment group interim data is from.

As the effect size will be blinded, blinded SSR will typically target nuisance parameters used in the sample size determination such as the variance or control proportion. While the estimates for these nuisance parameters may be improved by using unblinded interim data, in blinded SSR designs this improvement is negligible against the best blinded estimate. However by keeping the blind, the logistical and regulatory barriers for adaption are significantly lowered as there is less chance of operational or statistical bias.

For a detailed explanation of this topic, see the following video on blinded sample size re-estimation:

Q. What is unblinded sample size re-estimation?

In unblinded sample size re-estimation the sample size is re-estimated at the interim analysis using unblinded data. As the interim data is unblinded, the interim effect is known and this is the typically target for unblinded SSR. Despite results from previous studies, the treatment effect can have a high level of uncertainty at the design stage and thus the interim effect size is an obvious adaptation metric.

As unblinding creates significant risks for statistical and operational bias, unblinded sample size re-estimation is usually done in the context where an unblinded adaptive design is already planned, for example a design using the common group sequential design. Due to this, the most common unblinded SSR designs are extensions to the group sequential design which add the option to increase the sample size in addition to option for early stopping.

The most common unblinded SSR framework assumes a design which powers initially for a more optimistic effect size but allows sample size increases for interim effect sizes which are less than the optimistic effect size but which are still “promising” for a smaller but still clinically relevant difference. For this reason, these designs are often called “promising zone” designs.

To evaluate whether an interim result is “promising”, conditional power is the most common metric. However, some have suggested alternative metrics such as predictive power due to strong assumptions regarding the “true” effect size in conditional power calculations.

For a detailed explanation of this topic, see the following video on unblinbed sample size re-estimation:

Conditional power is the probability that the trial will reject the null hypothesis at a subsequent look given the current test statistic and the assumed “true” parameter values, which are usually assumed to equal their interim estimates or their initial planning values.

Predictive power (also known as Bayesian Predictive Power) is the conditional power averaged over the posterior distribution of the effect size. It is commonly used to quantify the probability of success of a clinical trial. It has been suggested as a superior alternative to conditional power as it treats the “true” estimates as uncertain rather than fixed.

A parallel study is a type of clinical study where two groups of treatments, A and B, are given so that one group receives only A while another group receives only B.

A crossover trial is a study where patients receive a sequence of different treatments. The patients cross over from one treatment to another during the course of the study.

A baseline study defines a benchmark that can be used to measure progress and achievements of a project against. A baseline study needs to be carried out before or at the beginning of implementation of an intervention.

Initial clinical trials on a new compound usually conducted among healthy volunteers with a view to assessing safety (e.g. finding the Maximum Tolerated Dose (MTD))

Once a drug has been established as safe in a phase I study, the next stage is to conduct a clinical trial in patients to determine the optimum dose and to assess the efficacy of the compound. Note that phase II trials can often be split into Phase IIa and Phase IIb trial where the objective is proof-of-concept (e.g. biological activity) and treatment selection (e.g. optimal dose-finding) respectively.

Large multi-centre comparative clinical trials to demonstrate the safety and efficacy of the new treatment with respect to the standard treatments available. Two successful phase III trials are typically required for regulatory approval.

Studies conducted after a drug is marketed to provide additional details about its safety, efficacy and usage profile.

The purpose of a pilot study is to examine the feasibility of a clinical trial that is intended to be used on a larger scale.

A superiority clinical trial is carried out to show that a new drug is more effective than another drug that it is being compared to.

This is a trial with the primary objective of showing that the response to the investigational product is not clinically inferior to a comparative agent (active or placebo control).

This is a trial with the primary objective of showing that the response to two or more treatments differs by an amount which is clinically unimportant. This is usually demonstrated by showing that the true treatment difference is likely to lie between a lower and an upper equivalence margin of clinically acceptable differences.

This is when two drugs with the same active ingredients or two different dosage forms of the same drug have similar bioavailability and produce the same effect.

This relates to the amount of patients with the condition who have had a positive test result.

This refers to the rate of elimination of a possibility of disease by testing it. It is the number of patients without the condition with a negative test result.

Precision Medicine is when researchers evaluate a person’s genetics, lifestyle, and environment to create a treatment plan and prescribe

the correct medication.

For further information about adaptive clinical trials -

Commercial, academic &

government organizations

Recognized by the FDA, EMA

& other regulatory bodies

Copyright © Statsols 2018, All Rights Reserved. Privacy Policy