These interactive dashboards are powered by data we collected to study the availability of books for elending to public libraries in Australia, the US, Canada, the UK and New Zealand. We included only English language books to facilitate title level comparison across countries.
The Focused Australian Study shows availability and terms of access for a sample of 546 books across the five main Australian library ebook aggregators (Overdrive, James Bennett, Wheelers, Bibliotheca and Bolinda). This case study provides an overall snapshot of the competitive playing field in a single market, including similarities and differences across aggregators on a title level.
The Focused International Study shows comparative availability and terms for those same books across a single platform in Australia, the US, Canada, the UK and New Zealand, providing new insights into international similarities and differences.
The Large-scale International Study shows comparative availability and terms for a much larger sample of almost 100,000 books across those same five jurisdictions.
For all studies, the data powering the dashboards was collected during the week of 17 July, 2017.
This page describes how we gathered the data that powers the dashboards, and used it to calculate results.
The Focused Australian Study captures comprehensive availability, pricing and licence data from all five key aggregators operating in the Australian market: Overdrive, James Bennett (Axis360), Wheelers (ePlatform), Bibliotheca (cloudLibrary) and Bolinda (BorrowBox). This acts as a case study to demonstrate intra-jurisdictional differences in availability, terms and pricing - and unearthed some very surprising results.
In some cases, the contracts between aggregators and libraries included confidentiality restrictions which prevented our partners from simply providing this data directly to the research team. That meant we had to secure the agreement of all five aggregators to obtain the required data. We ultimately secured full participation on the basis that:
We constructed a sample of 546 books using various proxies for quality and demand to identify books significant to English language libraries and readers (particularly in Australia). As noted above, we were constrained in the number of titles we could include. Since the data came from different sources, that data also had to be manually ‘cleaned’ to enable accurate cross-platform comparison.
This sample was not intended to be representative of all the books in the market, but rather to include a range of titles of likely interest (see composition breakdown below). It captures a range of current books as well as older titles, and books intended for different audiences across a range of genres (though limited to the English language). More newer books than older books were included to reflect their higher weighting in library collections.
We constructed the focused study sample from bestselling, most held, and award-winning titles as set out below (removing duplicates and inapposite titles such as colouring, sticker and multimedia books):
The complete list of sampled titles can be viewed by downloading the public dataset from the interactive dashboard.
In the study, we categorised licences according to two dominant lending models: ‘OC/OU’ or ‘Metered Access’. We use OC/OU (‘One Copy/One User’) to refer to a perpetual licence that is not restricted by either time or number of loans. We use ‘Metered access’ to refer to licences that are restricted by time, number of loans or both. We then further break down the forms of metered access by type (eg ‘26 loans’, ‘26 loans/12 months’ etc). None of the licences included in this study permitted simultaneous loans. We also had no books that were made available on a flat ‘price per loan’ licence, an option which began to emerge in the Australian marked soon after we collected this data.
Note that there may be a difference in lending model without that difference necessarily being of much significance. This only happened often in the Focused Australian Study, in which we found that different aggregators offered the sampled titles with different lending models in slightly more than 40% of cases. Sometimes we found these differences to be significant: for example, where a title was offered as OC/OU on one platform, and with a form of metered access on another. Often however, we found the differences to be insignificant. For example, titles that were offered on 24 month licences on platforms 2 and 3 were often offered on 24 month/36 loan licences on platforms 1, 4 and 5. Since it’s unlikely that a title would be circulated more than 36 times in two years anyway, this was a distinction that likely made very little difference. You can explore these differences and make your own decisions about their significance by scrolling down to the charts containing the breakdowns of lending model and varieties of metered access offered.
Each of the five aggregators either provided us with availability, pricing and licence information for the sampled titles (current to the week of 17 July 2017), or gave us permission to access their library-facing software to collect it ourselves. Prices are for titles only and do not include separate platform/hosting fees.
We have included the data as provided to us by aggregators which, we have determined, contains sporadic errors. Cross-checking and rigorous control procedures make us confident that the error rate is small, but users should be aware of this limitation and bear it in mind when drawing conclusions from the data.
Physical availability and pricing data for those titles was gathered at first instance by our partner Yarra Plenty Public Library (YPPL) from their suppliers. Where YPPL could not obtain a physical copy of a book via that source, the research team checked availability on the Australian book selling website Booktopia (booktopia.com.au). Where a book was not available via either the library supplier or Booktopia, it was recorded as not available. Physical pricing and availability data was collected during the same week as the platform data.
Prices for physical books are listed at the recommended retail price because confidentiality requirements restrain our partners from sharing the actual prices offered to them by their suppliers. Although higher than prices typically paid by libraries, the RRP can be seen as a good proxy for actual cost once the additional processing costs involved in getting the book catalogued and onto shelves are taken into account.
A member of the research team did a line-by-line check to manually verify that the data provided by aggregators correctly matched each title in each jurisdiction.
Occasionally a platform offered multiple prices and/or licence types for a single title. However, comparison across jurisdictions required us to reduce down to a single licence for each title per platform. To achieve this (in the small number of cases where multiple licence options were available) we had to choose a single ‘best’ licence for inclusion. We did this by comparing the ‘price per loan’ for the available options. Comparison was made more difficult by the fact that many licenses were of the ‘exploding’ variety: they may last for 36 checkouts, or 2 years, whichever comes first, and thus the cost per loan depends greatly on how many checkouts occur before the title expires. Comparison was still further complicated by the fact that perpetual OC/OU licences have no limit on the number of loans. Thus, we made several assumptions in order to make the comparison possible between the different licences. As described below, these assumptions have been made conservatively, in order to be able to draw conclusions that are the most likely to be valid. We developed these in consultation with our partners and acknowledge that, because there is no such thing as an ‘average’ book, no assumptions are likely to be truly satisfactory. However, these are the ones that we have used for the purpose of choosing between multiple offered licences in those few cases where it was necessary to do so:
In those rare cases where both OC/OU and MA licences were available from a single platform or jurisdiction for the same title, we calculated the price per loan for MA licences first applying (1) and (2). Then we compared the cheapest price per loan with the price of OC/OU based on the rule in (3) to calculate the single ‘best’ licence.
We were interested in identifying the e-book publishers for each title in order to detect patterns in how different publishers (and publisher types) manage e-lending.
We obtained the Australian publisher data from a single aggregator for all of the sampled titles they had available. For the remaining titles, a member of the research team manually collected the details of the e-book publisher from the information listed on Amazon’s Australian site. Where we could not identify a local e-book publisher for a title we marked the publisher as ‘Unknown’.
We have sought to group publishers/imprints with their parent companies to help illuminate broader differences in licensing and pricing strategies. The groupings are intended to reflect the publishing landscape as of 17 July 2017, ie the week the data was collected.
We have taken all possible care to be accurate in these groupings, but there may still be errors (especially given the ever-changing publishing landscape). There may also be differences of opinion as to whether a publisher should be included in a parent group. For example, it may be that a parent company owns a significant stake in a publishing house, but not the whole. Users may download the dataset from the relevant Dashboard and re-analyse with different groupings should they wish to do so.
We were interested in understanding whether and how book prices varied between platforms. To determine price difference, we looked at whether the price was the same across multiple platforms for each title. Where there was a difference, we looked at its size. To show the results, we created a dashboard chart with price difference categories: “0% -> 1%”, “2% -> 5%”, “5% -> 10%”, “10% -> 20%”, “20% -> 30%”, “30% -> 40%”, “40% -> 50%” and “more than 50%”. The image below shows one of the charts depicting such price differences.
We calculated the size of the price difference by looking at the percentage of variation between the mean price and the price furthest from the mean. The following shows in more detail how we transformed a list of prices for a title into the associated price difference percentage category.
We were interested in exploring how the age of books interacted with availability, terms and pricing. The research team collected the initial year of publication manually, using Goodreads data at first instance, and then cross-referencing with other sources where indicated.
Our second investigation, the Focused International Study, looks at the comparative availability and terms of access from a single aggregator across five territories: Australia, NZ, the UK, the US and Canada. Unlike the Focused Australian Study, it does not provide comprehensive availability data in any jurisdiction (since a title that is not available from that single aggregator could potentially be available from another one that is not reflected in this data). Instead, its primary use is to uncover international differences in availability, licensing and pricing for the sampled titles to contrast with the intra-jurisdictional differences identified in the Focused Australian Study.
In many respects the methods for the Focused International Study replicated those used for the Focused Australian Study. We sum up those similarities here, before describing the differences below.
The most significant difference between the Focused Australian Study and the Focused International Study is in how the availability and pricing data was collected.
The Focused Australian Study sometimes had to rely on data provided by aggregators. As noted above, we became aware that there were some sporadic errors in that data. By contrast, the data for the Focused International Study was collected by members of the research team using the aggregator’s library interface, and cross-checked using various quality control procedures. This method eliminated the possibility of data entry errors, though there is still a possibility of minor inaccuracies in the event that a research team member missed detecting a title during a search (which could happen, for examine, if both the author name and title name had incorrect spellings).
Different books can have different copyright holders in different territories. We identified US, UK and Canadian e-book publishers by using the same method as for identifying the Australian e-book publisher, ie using the local aggregator data where it was available, and otherwise data from the relevant local Amazon site. NZ does not have a local Amazon site and we were unable to reliably collect the data from an alternative source. Therefore, we marked the publisher as ‘unknown’ for all books which were unavailable from the aggregator in NZ.
The sheer scale of this study (almost 100,000 titles) meant that we could not use any of the 'manual' processes we had relied upon for the Focused Studies. Instead, as explained below, we developed a number of innovative methods to automate those tasks. However, there were still some methodological intersections with the focused studies. We sum up those similarities below, and then describe the differences in more detail.
Before describing in detail the major differences in how this study was put together, here are the minor ones:
We constructed a database containing all ebook checkouts from Western Australia, South Australia, Tasmania and the Australian Capital Territory and used it to identify the most borrowed authors of ebooks in Australia. The sample then consisted of all titles by these authors available in any of the five jurisdictions. The nature of the sampling methodology means that books by homonymous authors (ie two people with the same name spelt exactly the same way) will also appear.
To compare titles, we have to link the records to ensure the right books match up across platforms and jurisdictions. We did this manually for the two Focused Studies, but that wasn't feasible for the sample of almost 100,000 titles. Instead, we developed an algorithm that applied a detailed set of rules to match records by title and author. The full rules are available in the source code published on Github, but some of the main ones for titles included ignoring case (eg ‘a’ was treated as identical to ‘A’), equalising US and UK spellings, and removing special characters such as brackets, slashes, dots and hyphens. For titles that weren’t matched after that, we looked for misplaced subtitles and removed stop words (‘the’, ‘a’, ‘an’). To be sure that titles really were identical, we required all listed authors to match in each jurisdiction (which means that if two authors were listed in one record because one wrote a foreword, it would not be matched to another record unless it had the same two authors, even if the title was the same). Where there was no match, we also reduced author names to their first initial and last name and looked to match that name and title pairing in other jurisdictions.
To assess the accuracy of the linking process we randomly selected 25 records. A record is identified by its title, its authors, and jurisdiction. We then tasked an independent researcher with linking those records to the same title in other countries, or to indicate no link where it was not available in a country. That resulted in 100 accurate, human-constructed links. We then checked those links against the algorithm-linked database, assessing which had been correctly identified (true positives), links correctly identified as missing (true negatives), links that were found where they should not have been (false positives) and links that were not have found but should have been (false negatives). Out of the 100 tested links, 95 were detected as present by our researcher, and 5 were absent. Our algorithm reached identical results, indicating 0 errors out of a possible 100 (ie there were no false negatives or false positives, and it found the same true positives and true negatives). While this is not to definitively say that our algorithm has not made any matching errors on the full dataset of almost 100,000 books, it positions its accuracy to be below 0.5% calculated by using Laplace smoothing with a = 0.5.This gives high confidence in the validity of the results we present.
The original publication year for each title was important because we wanted to understand the interrelationships between books’ age and availability, terms and price. That data was not included in the data we extracted from the aggregator, and given our sample included almost 100,000 titles, it was not feasible to manually collect it. Accordingly, we developed a method to estimate it from available sources. This was a challenging exercise not only because of the sheer number of books in existence, but because identical titles often refer to multiple different works, and the same work can have been repeatedly re-published over many different years.
Our automatic estimation method utilises the Goodreads Application Programming Interface (API). Our first request is by the book title and name of the first author. If the Goodreads API returned multiple responses, we then computed a similarity score based on the Levenshtein distance. The similarity score was the arithmetic average of an author similarity score and a book similarity score. The author similarity score was first computed between the first author name of the query and the first author name of each request result. If it was lower than 0.7, we estimated the author similarity score as the maximum similarity score between all the available author names of the query (if there were several) and the first author name of each request result. The book similarity score was computed between the title book of the query and the title of each request result. We then extracted the publication year of the result with the highest average similarity score for which the publication date was given by Goodreads.
Where the first Goodreads request on the book title and the first author name gave no results, we performed a second request by book title only. In that case the similarity score corresponds to the author similarity score. Where this second Goodreads request gave also no result, we performed a third request by only the name of the first author. The similarity score was then equal to the author similarity score.
Where the similarity score was less than 0.5, we had insufficient confidence in the result and declined to estimate a publication year. Ultimately, we declined to estimate a publication year in 8% of cases. The non-estimated titles are charted as having an estimated publication year of 2020 (designating that we declined to estimate their initial year of publication). In addition to cases where the similarity score was too low, we also declined to make an estimate in two other main cases: where there was only one request result, but it did not contain a publication date, and where the title could not be matched at all via the Goodreads API. In all analyses involving estimated publication year, we have excluded the non-estimated books from the sample.
We evaluated and tested this method on a random subsample of 100 books for which a researcher independently manually assessed original publication year. For this subsample, our algorithm declined to estimate a year for 6 titles, and achieved a perfect estimation for 73. Overall, our accuracy rate was 77.7%, and estimations were within ± 5 years in 91.5% of cases. The scatter plot below depicts the predicted year by our estimation method in the horizontal line versus the reference publication date in the vertical line. As shown, where the publication year was estimated incorrectly, it almost always underestimated. We found the method not to make robust estimations for titles that were subsequent editions of previously issued books with additional authors added. For example, our sample included the graphic novel edition of the Mancini Marriage Bargain, co-authored by Trish Morey and Ayumu Aso and published in 2015. Because our method primarily relies on the first author’s name, it estimated an publication date of 2005, when the original novel (authored by Trish Morey alone) was published. Nonetheless, we obtained better results with this method than when we attempted to match all author names (which introduced fresh difficulties).
If we update the datasets or visualisations in response to feedback, we'll list the updates here. There are no updates yet recorded.
This work is part of an Australian Research Council Linkage Project (LP160100387) led by Associate Professor Rebecca Giblin (University of Melbourne). The other Chief Investigators are Professor Kimberlee Weatherall (University of Sydney), Professor Julian Thomas (RMIT) and Dr François Petitjean. (Monash University). Our research team has also included postdoctoral fellows Dr Jenny Kennedy (RMIT) and Dr Charlotte Pelletier (Monash University), Master’s student in Data Science Woratana Ngarmtrakulchol, and research assistants Dan Gilbert, Jacob Flynn and Emily van der Nagel.
The Linkage Project is supported by formal partnerships with:
Further invaluable international cooperation has been contributed by:
We thank all our partners for their contributions of expertise, time and other resources, and to aggregators Overdrive, Baker & Taylor, Bibliotheca, Wheelers and Bolinda for their cooperation and support.
These interactive dashboards were produced by Woratana Ngarmtrakulchol (aka 'Perth'). The tools used were D3.js, Crossfilter.js, DC.js, Bootstrap, and FontAwesome. The dashboard project was completed in 2019.
Datasets and dashboards credit: Rebecca Giblin; Woratana Ngarmtrakulchol; Jenny Kennedy, Kimberlee Weatherall.