< back to full list of articles
Data Analysis Alert: A Case Study of What Not to Do

or Article Tags

Sunita Darnell is a social scientist and longtime reader of genre fiction. She reviews and posts about genre fiction at dearauthor.com, and reports that she blogs idiosyncratically at vacuousminx.wordpress.com and welcomes email at vacuousminx@gmail.com.

The Author Earnings Excel spreadsheets that Hugh Howey posted earlier this year (see authorearnings.com/the-report) clearly aroused controversy. Less clearly but more significantly, they provided object lessons for everybody who wants to avoid misleading themselves, and anybody else, while drawing conclusions from data.

As many of you may remember, some people got very, very excited by the conclusions the authors drew from Author Earnings data about self-publishing. And some people shouted critics down as being pro-publisher, anti-self-publishing, or having various other axes to grind.

I have no dog in this hunt. I’m not a fiction author and very little of my income comes from royalties or other direct remuneration from writing. I have to write and publish in order to earn promotions and salary increases, and to be respected in my field, but that writing is judged on quality, reliability, and academic contribution. No one really cares about sales.

I do, however, care about how data are collected, analyzed, and reported, and this report doesn’t pass my smell test for reliability and validity. As a political scientist trained in both political science and sociological methods, I’ve conducted qualitative and quantitative studies, and I’ve hand-built and analyzed a quantitative dataset comprising more than 3,000 observations. I’ve also taught research design and methods at the university undergraduate and graduate levels.

My concerns aren’t about math or statistics (the authors of this report use very little of either). My concerns are about (1) how the data were selected; (2) inferences from one data point to a trend; and (3) inferences about author behavior that are drawn entirely from data about book sales. I would be less concerned if the authors showed more awareness of the limitations of the data. Instead they go the other way and make claims that the data cannot possibly support.

I could write 10,000 words on this pretty easily (I started to), but I’ll spare you and highlight significant problems their analysis exemplifies.

The “sample of convenience” problem.

Having figured out a way to scrape Amazon, the authors of this report used Amazon data only, relying on what we call a sample of convenience. They justify extrapolating from Amazon to the entire bookselling market by saying that Amazon has been “chosen for being the largest book retailer in the world.”

This is analogous to me studying politics in the state of Alaska and then claiming my study provides an accurate representation of politics in the entire USA because Alaska is the biggest state. Or studying politics in China and then claiming my study explains politics all over the world because China is the most populous country. In other words: No.

Amazon is the biggest, but it’s not the only or even the majority bookseller. And it’s not a representative bookseller statistically; there is no reason to believe that Amazon provides a representative snapshot of all book sales. It probably sells the most e-books; it might sell the most titles; and it is where a lot of self-published authors sell their books. But it does not sell the most total units of (print and e-) books, which means comparisons across categories will most likely be skewed and unreliable.

If the authors’ conclusions were limited to Amazon-specific points, I would be less bothered. But they are making claims about general author behavior based on partial, skewed data. No. No. No.

The cross-section, snapshot problem.

The Author Earnings report provides one day’s worth of data, one 24-hour period of sales and rankings. We call this a cross-sectional study. Cross-sections are snapshots, which can be very useful to give you an idea of relationships between variables. But they have their own biases, and these biases can be irrelevant (good for your study) or relevant (bad for your study).

In this case, the 24-hour period comprised parts of the last Tuesday and Wednesday of January 2014. I can think of at least two relevant biases: (1) Books for a given month are frequently released on the last Monday or Tuesday of the previous month; and (2) people buy textbooks at the beginning of school semesters (January is the beginning of spring quarter/semester). This latter condition might create substitution effects (people spend their money on nonfiction books), which are not the same across all publishers and categories.

I don’t know if these biases matter, but the point is that the authors don’t even tell us they considered that picking this particular time period might raise bias issues. I mistrust studies where limitations aren’t at least discussed.

The trend problem.

Cross-sections cannot give you trends. A trend needs more than one data point. You cannot determine a trend from a single observation. If a book is #1 today, that doesn’t mean it will be #1 tomorrow. You cannot infer anything about the past or the future from a single data point in a cross-section.

The inference problems.

Nevertheless, the authors do try to infer from a single data point. In fact, they do a lot of inferring that is analytically indefensible. Let’s take the inferences in turn.

They infer sales numbers from rankings (because Amazon does not publicly report sales), based on their own books and information from other authors. In asking around, I have been told that the sales numbers correspond to rankings fairly well. I’m willing to believe this, but Amazon itself points out that rankings can change without sales figures changing and vice versa. Possibly, this bias is sufficiently general that it doesn’t compromise the inferences. But it’s something to keep in mind.

They take the sales numbers for one day (an inference) and combine them with the publisher data to get gross and net sales figures for each book. They then take the author’s net revenue and multiply that number by 365 to get the author’s earnings for that book for the entire year. This is absurd.

According to this “formula,” a book that sells zero copies on January 28/29 nets the author zero dollars for the year. A book that sells 7,000 copies on those days and is published by Amazon nets the author over $4 million.

Not only is this unbelievable (99 percent of books move around the list over the year, unless they’re stuck at 0 sales); it casts doubt on every other data and inference decision the authors make. It is very bad analysis. It is horrifically bad inference.

This criticism doesn’t even take into account the difficulty of estimating author earnings without including advances, but in my estimation it’s a sufficiently disabling criticism on its own.

They infer author behavior from data on books, risking “omitted variable bias,” where the correlations are inaccurate because not everything that matters is present in the dataset. In this case, too much author behavior is simply missing and can’t be inferred in any legitimate way.

Authors make choices about editing, packaging, writing quality, and so on that affect readers’ decisions to purchase books. In addition, anecdotal evidence consistently points to the importance of a backlist for readers, and that backlist can consist of self- or publisher-published books. None of these variables is captured in this data.

They often extrapolate from the sales of books with Amazon imprints without considering whether Amazon books have any special advantages at its own site compared to other releases. The Thomas & Mercer Amazon imprint is the top-selling imprint in this dataset, with a lucky author on course to make $4 million, according to the report. (Or not, depending on which planet and universe you inhabit.)

It makes sense that Amazon does so well at Amazon, since the company has many ways of boosting visibility and naturally uses those techniques to sell its own books. And since the New York Times and USA Today don’t include exclusive-vendor books on their bestseller lists, we can’t see how Amazon books do in other rankings.

It’s a classic problem of comparability: Amazon doesn’t include preorder sales in its rankings, and the Times and USA Today don’t include Amazon-only books in their rankings, so you can look at apples in one and oranges in the other, but you can’t look at apples and oranges together.

In an attempt to bolster their existing data, the authors look at BookScan numbers, but because BookScan doesn’t break down data between e-book and print sales, the information revealed to us in this new report doesn’t help us situate the Amazon data in a larger context.

At best, the BookScan numbers might reveal the proportion of books sold at Amazon relative to the larger marketplace captured by BookScan (although BookScan doesn’t account for all sales). But instead, the authors use BookScan numbers to compare e-book sales to print sales—a completely different issue.

The missing information problem.

The authors provide data in an Excel spreadsheet format so that the rest of us can analyze it. I appreciate this, and I’m happy to work with flat files, although there are a lot of advantages to relational databases (e.g., MySQL and Access). But when I downloaded the files I realized that (a) this is not “raw data” and (b) important information is uncollected or removed.

In particular:

We don’t know the release date. If they had picked up the release date when the data were scraped, we would have known where the book was in its life cycle. Then we could have applied a survival model to estimate a rate of sales decay over time. The date would also have helped us identify whether the price in the cross-section is permanent or temporary. These data aren’t perfect, because publishers can change release dates (and there are different release dates for different editions, including self-published updates). But being able to use even imperfect release date information allows revenue projections to approximate something that isn’t prima facie absurd.

The “author data” sheet in the file combines all of each author’s books into one observation (one row in the spreadsheet) and labels them with one publisher category. This potentially conflates self- and publisher-published books, books across genre categories, and top-selling and lesser-selling books (case in point: #1 Book has 7,000 sales and #1 Author has two books with 7,000 sales, so one of #1 Author’s books has 0 sales).

I would like to decompose this data, but I can’t. Because the author information has been “anonymized,” as well as the title information, I can’t combine the information provided in the two sheets into one dataset.

This is apart from the main problem, of course, which is that there is very little point to running more rigorous statistical analysis because the underlying data have essential reliability and validity problems.

Suggestions About “Our Data Suggests . . . ”

A sentence in the report has been making the rounds: “Our data suggests that even stellar manuscripts are better off self-published.”

No. That conclusion is writing a check that the data can’t cash.

As an empirical researcher who respects the limits inherent in all data collection and analysis, my strongest advice for readers of the Author Earnings report is to read it as you would read any interesting tidbit about the publishing industry. Treat it as entertainment, not information.

And if you’re interested in using data analysis more generally, think of this as a stellar example of mistakes to avoid.

Connect With Us

1020 Manhattan Beach Blvd., Suite 204 Manhattan Beach, CA 90266
P: 310-546-1818 F: 310-546-3939 E: info@IBPA-online.org
© Independent Book Publishers Association