When it comes to research studies, most people are impressed by big numbers. But how do you define ‘big’? Against what?

Weldome to Part 2 of the Science Critique 101 Series.

In the first part of this Science Critique 101 series, I highlighted the need to be critical of research but also warned about dismissing a study entirely because it has shortcomings. This is because, when it comes to designing a research method, every decision that a researcher makes involves a trade-off. Once a research design has been painstakingly finalised around an ideal scenario, its practical undertaking can be limited by unavoidable factors.

Research undertaken in the natural setting of the ‘field’ can be impacted by the unexpected, but even research undertaken in the controlled environment of a laboratory is subject to the unforeseen such as running out of funding, staff losses, participants changing their mind and equipment failure.

No piece of research is perfect, however this doesn’t mean that the findings are not useful. It just means that their limitations need to be taken into account when we make sense of their findings.

Most people are familiar with what I am calling the ‘Big Four’ critiques: sample size, research bias, peer-review and funding. In particular, they find it easy to call out a study for having a small sample, being done by biased researchers, being ‘rubber stamped’ or having biased funders.

While all of these concerns are legitimate, how likely are they to be reflected in peer-reviewed research and do they warrant the dismissal of an entire study?

In the current article, I look more closely at sample size.

How do you define big?

First, take a moment to ask yourself – how many horses would you want to know were involved in a study for you to accept the findings as true?

Would you like 100, 200, or 1,000?

It seems that the larger the sample, the more people seem likely to believe that the findings are true and accurate. However, a number by itself is not very meaningful. What does matter is the representativeness of the sample, the complexity of the data points, and the significance of the findings.

Moreover, whilst it should not be relevant from the perspective of conducting sound scientific research, the socio-political climate in which research is conducted can also have an impact on how people react to sample size.

A little girl holding a very large black horse Depending on what kind of data are being collected and how they will be analysed, a bigger sample size might not only provide no benefit, it could simply be unmanageable. Shutterstock.

Representativeness of the sample

There are new scientific announcements about horses all the time. We are continually learning something new about their behaviour, their training, their digestive system, their movement and their health.

All of these facts are determined through some form of research.

If the findings seem reasonable to us, we are not likely to ask how many horses were involved, and we often just assume that some of the most influential horse studies have been based on sample size of hundreds if not thousands – especially when we are often told that bigger, more, stronger, faster, higher, etc., is better.

Whilst this belief is understandable and makes sense to a certain extent, sample size is only a small part of research critique! In science as in life, bigger is not always better.

Let’s consider an example.

According to one study, ‘Three weeks of daily exposure to, and exercise in, hot and humid ambient conditions resulted in a progressive reduction in thermal and cardiovascular strain’. 

This sounds like a reasonable conclusion to make. Nothing controversial there. But what if you knew that study was based on 6 horses? And what if I said ‘that study was based on only 6 horses’?

Does the importance of the number diminish because I added the word ‘only’ before them?

When it comes to research studies, most people are impressed by big numbers. But how do you define ‘big’? Against what?
The Žemaitukas horse from Lithuania is a rare and ancient breed. Image source Wikimedia Commons.

It might be easy to dismiss a genetic study of a horse breed when you learn that DNA samples were taken from only 31 horses. 

But what if the researchers were interested in the DNA of an ancient breed of horses that in 2003 only numbered 124 individual horses?

In this case, the 31 horses represented 25% of this ancient breed’s known population at the time. Each horse may not equally represent the whole, but the researchers based their findings on data taken from one quarter of the population in which they were interested. Within the broader context of the total population of this rare breed, a sample size of 31 is actually rather ‘big’.

At the same time, whilst their findings might be representative of Žemaitukai horses, it would be inappropriate for the authors to generalise their findings to all horses. Not only would their sample lack representativeness, it would be quite ‘small’ for those purposes.

In fact, there is a point of diminishing returns. To debate sample size is to engage with competing theories in mathematical philosophy. Nevertheless, as a rule of thumb, 400 participants is the point at which there is an acceptable margin of error. More than 400 is not thought to improve the accuracy of the data that is produced through research. Some people find this the case at 200 or 300. But it depends on the kind of data being collected (short, long, able to taken at face value or in need of further analysis).

I would not interview more than 40 people for a study based on in-depth interviewing and some PhD studies are based on detailed interviews with a sample of 10 people, but the level of analysis might include when people paused when talking and for how long!

Data points

Sometimes, a sample size number hides the number of data points that were taken into consideration when determining the findings of a study.

A 2020 study of fearfulness in horses, for example, was based on a sample of 25 stallions. By itself, a sample size of 25 might appear to be ‘small’. However, that sample size hides the amount of data (facts and statistics) that were produced through a longitudinal research design.

Each of the 25 stallions participated in an experiment on three separate occasions – at the ages of 5 months, 12 months and 42 months.

On each occasion, data was taken in the form of heart rate recordings and behavioural observations. Heart rate was taken according to two variables: average and maximum.

Observations were made according to another five variables: latency, alertness, investigation, sniffing and touching the objects.

So, that’s seven variables for 25 stallions taken on three different occasions, providing over 1,000 points of raw data to accommodate in statistical testing.

This case is similar to the aforementioned study of ‘Adaptations to daily exercise in hot and humid ambient conditions in trained Thoroughbred horses’ involving 6 horses and the measurement of multiple variables each day for 22 days.

Incidentally, the authors of the study about the ‘Development and consistency of fearfulness in horses from foal to adult’ concluded that fearfulness remained consistent over time.

Still, the extent to which the findings are generalisable to mares and geldings requires further study. If I was a reviewer of the paper, I would have suggested that the word ‘horses’ be replaced with ‘stallions’ in the title, to reinforce the limitations of the sample.

The novel object test is commonly used to make inferences about horse temperament. Shutterstock.

Significance of the findings

Last year, I co-authored a study about whip use in racing.  By comparing the Stewards’ reports from 67 “Hands and Heels” races (where whips are held but not used), with 59 reports from comparable races where whipping was permitted… ‘we found no evidence that whip use improves steering, reduces interference, increases safety or improves finishing times’.

Just days before the 2020 Melbourne Cup, I allowed myself to be drawn into a Twitter debate about the extent to which these findings were true and accurate.

Unsurprisingly, there were comments about the fact that we ‘only’ looked at 67 whipping-free races. Had our critics read the very detailed methods and discussion sections of our publicly-available study, they would have realised that we studied 100% of all the whipping-free races from when new rules were brought into the series in January 2017, until the end date of our research in December 2019.

Now, 67 might look like a small number, but you can’t get bigger than 100%, at least statistically speaking.

Socio-political climate

It is nice to think that science exists in a vacuum that is free of the influence of culture, society, politics, bias, etc., but that is more of an ideal than a reality. All research is influenced by the socio-political climate in which it is undertaken, published or received.

Whether or not the findings of research are disputed or critiqued largely depends on how much it suits people to believe the findings – if at all. This is where bias (read this: https://bit.ly/3cgeqIf) and logical fallacies (read this: https://bit.ly/3g5AQNt) also influence the extent to which we might critique sample size.

Racehorses and jockeys galloping towards the finish line

Imagine that the findings of the DNA testing of 31 Žemaitukai horses was taken to suggest that horses evolved to reach peak happiness when they were in the company of humans. I doubt many people would challenge the findings. In fact, it would probably be the top share on every horse person’s Facebook page.

Now imagine that the findings were taken to prove that horse breeds that were ridden by humans suffered irreparable genetic damage that made them have short and unhappy lives. I’m sure there would be a unanimous cry for the research to be re-done.

Of course, one can’t make these hypothetical inferences from studies of DNA, but the example serves to illustrate that regardless of the numbers, not everyone hears (let alone accepts) the same message.

In our study case, the people who challenged our findings that whip-use is unrelated to the speed or straightness of Thoroughbred racehorses, were reacting to the fact that the finding challenged their own beliefs (even beliefs about their own experiences of using whips). Dismissing the findings due to a small sample size was simply an easy way to preserve those beliefs.

So, if we take representativeness, data points and socio-political climate into consideration, we can see that bigger is not always better.

Let’s leave that belief for the Clydesdales.

In the next Part of this Series, I take a look at another common science critique; research bias.

I will consider how researchers can sometimes be accused of pushing their own agenda and how to contextualise that critique within the realities of doing research and being a researcher.


Christensen, J. W., Beblein, C., & Malmkvist, J. (2020). Development and consistency of fearfulness in horses from foal to adult. Applied Animal Behaviour Science, 232, 105106.

Geor, R.J., McCutcheon, L.J. and Lindinger, M.I. (1996), Adaptations to daily exercise in hot and humid ambient conditions in trained Thoroughbred horses. Equine Veterinary Journal, 28: 63-68. https://doi.org/10.1111/j.2042-3306.1996.tb05033.x

Juras, R., Cothran, E. G., & Klimas, R. (2003). Genetic analysis of three Lithuanian native horse breeds. Acta Agric Scand (A), 53(4), 180-185.

Thompson K, McManus P, Stansall D, Wilson BJ, McGreevy PD. Is Whip Use Important to Thoroughbred Racing Integrity? What Stewards’ Reports Reveal about Fairness to Punters, Jockeys and Horses. Animals. 2020; 10(11):1985. https://doi.org/10.3390/ani10111985

This article was first published in the July-August 2021 edition of Horses and People Magazine. 

Browse this and all our other magazine issues here.

Instant download!