Skip to main content

Command Palette

Search for a command to run...

Building a Product Analytics Pipeline from Web-Scraped Book Data

Published
13 min read
Building a Product Analytics Pipeline from Web-Scraped Book Data

Introduction

Week 13 of the DataraFlow programme marks another important milestone after two weeks of intense work on group presentations and pitching our ideas to a panel of investors.

This week, however, have moved up! We’re scraping a webpage using Python and a few essential libraries. In this article, I’ll walk you through how I scraped Bookstore data from Books to Scrape, explored the dataset, and performed some basic exploratory analysis to extract meaningful insights.

The objective of the Week 13 take-home assessment was to demonstrate our ability to:

  • Scrape data from the web using Python

  • Convert HTML content into a structured dataset

  • Perform basic exploratory data analysis

  • Communicate insights clearly

This project was especially a learning opportunity for me because it exposed me to one of the different ways data could come in and the unique ways each can be handled. Let’s get right into it.

2.0. Data Collection Methodology

As I stated earlier, I chose to scrape data from a bookstore site. If you’re asking why, it’s because I love reading books and thought that would be an interesting place to begin. So after settling on what site I wanted to scrape, I had to inspect it. I just wanted to understand the structure of the page, as this greatly affects the method of approach for our scraper.

2.1. HTML Structure of the Scraped Webpage

The target website (books.toscrape.com) has a consistent and well-structured HTML layout**,** making it suitable for automated data extraction. Each book on a catalogue page is contained within an:

<article class="product_pod">

This <article> element acts as the primary container for all book-related information. Inside each product_pod, the relevant data points are located in predictable sub-elements:

  • Book title is stored as an attribute (title) of the <a> tag inside an <h3> element:

      <h3><a title="Book Title"></a></h3>
    
  • Price is found in a paragraph tag with class price_color.

  • Availability status appears in a <p> tag with classes instock availability, with the actual text nested inside.

  • Rating is encoded as a CSS class (e.g., star-rating Three), rather than as visible text.

  • The book detail link is a relative URL inside the <a> tag, requiring concatenation with a base URL.

  • The thumbnail image is located within an <img> tag whose src attribute points to a relative image path.

This hierarchical and class-based structure allows the scraper to locate and extract information reliably using BeautifulSoup’s find() and find_all() methods without relying on fragile positional indexing.

2.2. Data Gathered

The scraping process collects both textual and media-related data for each book. Specifically, the following fields are extracted:

  • Title: The full book title, captured from the HTML attribute to avoid truncation.

  • Price: The listed price as displayed on the catalogue page.

  • Availability: Stock status (e.g., “In stock”).

  • Rating: Converted from a textual CSS class (e.g., “Three”) into a numerical scale (1–5) for easier analysis.

  • Product Link: A fully qualified URL pointing to the book’s detail page.

  • Thumbnail URL: The original image source link.

  • Thumbnail File Name: A locally saved image file, with sanitized filenames to prevent filesystem errors.

The data from multiple pages is aggregated into a single structured dataset and exported as a CSV file, making it suitable for downstream analysis, visualization, or machine learning tasks. Additionally, book cover images are downloaded and stored locally, enabling use cases such as visual exploration, image-based analysis, or dataset enrichment.

2.3. Web Scraping Strategy and Justification

a. Pagination Handling

The website uses page-based pagination, with URLs following a clear and predictable pattern:

https://books.toscrape.com/catalogue/page-{n}.html

This allowed the scraper to iterate programmatically through multiple pages using a simple loop, ensuring scalability while maintaining control over the scraping scope (limited to the first 10 pages for efficiency and politeness).

b. Choice of Libraries

  • Requests was used for HTTP requests due to its simplicity and reliability for static web pages.

  • BeautifulSoup was chosen for HTML parsing because the site’s structure is static and does not require JavaScript rendering.

  • Pandas was used to organize the scraped data into a tabular format and export it cleanly to CSV.

  • OS utilities ensured proper directory management for structured data and image storage.

This combination provides a lightweight, readable, and maintainable solution without unnecessary complexity.

c. Robustness and Data Quality

  • Ratings were converted from text labels to numeric values to support quantitative analysis.

  • Relative URLs were normalized into absolute URLs to prevent broken links.

  • Image filenames were sanitized to avoid invalid characters and ensure cross-platform compatibility.

d. Ethical and Practical Considerations

The script limits the number of pages scraped, avoids excessive request frequency, and targets a publicly available educational website designed specifically for scraping practice. This ensures the approach is both ethical and responsible.

3.0. Data Cleaning and Normalization

Following data collection, a series of cleaning and normalization steps was applied to improve consistency, usability, and analytical readiness of the dataset.

3.1. Column Name Standardization

All column names were converted to lowercase, and spaces were replaced with underscores. This enforces a consistent naming convention that improves readability, reduces the risk of case-sensitive errors, and aligns with common data science and database best practices. Standardized column names also simplify downstream operations such as filtering, aggregation, and feature engineering.

3.2. Price Cleaning and Type Conversion

The price field was originally extracted as a string containing currency symbols and encoding artifacts. These non-numeric characters were removed to isolate the numeric component of the price. After cleaning, the column was explicitly cast to a floating-point data type.

df["price"] = (
    df["price"]
    .str.replace("£", "", regex=False)
    .str.replace("Â", "", regex=False)  # handles encoding artifacts
    .astype(float)
)
df["rating"] = df["rating"].astype(int)

This step is critical because numerical price values are required for statistical analysis, comparisons, and derived metrics such as averages or value-for-money scores. Addressing encoding artifacts early also improves dataset robustness and portability across systems.

3.3. Rating Normalization

Book ratings were converted to integer values, ensuring they could be treated as an ordinal numerical variable. This enables straightforward computation of distributions, summary statistics, and correlation analysis with other numerical features such as price.

3.4. Stock Availability Encoding

A new binary variable (in_stock) was created from the availability text field. By converting a descriptive string into a Boolean indicator, stock status becomes easier to filter, count, and model. This transformation also preserves the original availability text while providing a machine-friendly representation suitable for analytical and predictive tasks.

The cleaning strategy prioritizes minimal transformation with analytical value. Rather than discarding raw fields, the approach standardizes formats, enforces appropriate data types, and derives compact features that improve interpretability. These steps ensure the dataset is both human-readable and optimized for exploratory analysis, visualization, and potential machine learning applications.

4.0. Exploratory Data Analysis (EDA)

This section presents insights derived from exploratory analysis of the scraped book catalog across five strategic dimensions:

  • Pricing Distribution

  • Product Quality

  • Pricing Strategy

  • Inventory Management

  • Customer Value Scoring

The objective of this analysis is to inform business decision-making, enhance customer experience, and support data-driven product and pricing strategies.

4.1. Pricing Distribution

We analyzed the distribution of book prices to understand price range, concentration, and the prevalence of premium offerings. This provides insight into which price segments dominate the catalog and whether pricing patterns align with perceived quality (ratings).

Findings:

  • Book prices range from approximately £10 to £60, indicating a broad and diverse pricing landscape.

  • The majority of titles are concentrated in the mid-range (£20–£45), representing the core market segment and likely accounting for the largest share of customer purchases.

  • Smaller clusters at lower (£15–£25) and higher (£50–£55) price points reflect the presence of budget and premium offerings, though these segments are comparatively limited.

  • Premium books are relatively rare, with few titles priced above £50, suggesting that high-priced offerings occupy a niche position in the catalog.

  • Price–rating analysis shows no consistent relationship between price and quality; highly rated books are distributed across all price ranges. This indicates that pricing is likely influenced by other factors such as format, subject area, or publisher strategy.

Business Implications & Next Steps:

  • Prioritize marketing and promotions in the mid-range segment to maximize reach and revenue potential.

  • Monitor premium offerings for opportunities in targeted promotions, bundling strategies, or curated collections.

  • Conduct deeper analysis of price drivers (e.g., genre, format, publisher) to refine pricing strategy and improve alignment with customer expectations.

4.2. Product Quality and Ratings

We analyzed the distribution of book ratings (1–5 scale) to assess the overall quality profile of the book catalog and identify the proportion of high-performing titles. Ratings serve as a proxy for customer satisfaction and perceived value.

Findings:

  • Books rated 1 star form the largest single group, representing nearly 50 titles.

  • Titles rated 2, 4, and 5 stars each cluster between approximately 35–40 books, while 3-star titles are slightly less frequent.

  • 37.5% of the catalog consists of high-rated books (4–5 stars), indicating that roughly one-third of offerings are strongly appreciated by readers.

  • The presence of a significant number of low-rated books suggests high variability in product quality, potential content inconsistencies, or subjective bias in user reviews.

Business Implications:

  • Prioritize high-rated titles in promotions and recommendations to improve customer satisfaction and trust.

  • Review lower-rated books for potential curation, quality improvement, repositioning, or replacement.

  • Integrate rating-based prioritization into recommendation engines to enhance user engagement and perceived catalog quality.

4.3. Pricing Strategy

To evaluate whether customers pay more for higher-quality books, we measured the statistical correlation between book price and rating.

Findings:

  • The correlation coefficient between price and rating is 0.017, effectively zero.

  • This indicates that higher-rated books do not consistently command higher prices.

  • Prices are broadly distributed across all rating levels, suggesting that pricing decisions are driven by non-quality factors, such as format, publisher pricing models, genre classification, or production costs.

Business Implications:

  • Pricing strategy should not assume rating-based price premiums.

  • There is strong potential to market high-quality, lower-priced books as high-value offerings.

  • Further analysis should explore non-rating price determinants to support optimized pricing and segmentation strategies.

4.4. Inventory Management

We assessed stock availability to understand catalog coverage and potential supply constraints.

Findings:

  • Based on available data, 100% of books are reported as in stock.

  • While this suggests excellent inventory coverage, the result requires careful interpretation. It is possible that out-of-stock books are not listed on the website, which would produce the same outcome while implying a different operational reality.

Business Implications:

  • Conduct validation checks to confirm the reliability of stock availability data.

  • If confirmed, maintain strong stock levels for high-demand and high-value titles to avoid lost sales opportunities.

  • Implement longitudinal stock monitoring to detect emerging supply constraints and demand pressures early.

4.5. Customer Value Scoring

Customer Value Scoring identifies books that deliver the strongest balance between quality and price. A higher score represents stronger “value for money” from a customer perspective.

Findings:

  • The metric identified a top-10 group of high-value books that combine strong ratings with moderate pricing.

  • Notable examples include:

    • The Third Wave

    • Princess Between Worlds

    • Princess Jellyfish 2-in-1 Omnibus

These titles represent high-impact value offerings with strong appeal to value-conscious customers.

Business Implications & Next Steps:

  • Curate a “Best Deals” or “Best Value” section within the platform.

  • Integrate the Customer Value Score into recommendation engines to prioritize high-value suggestions.

  • Design targeted promotions around high-value books to increase engagement, conversion, and customer satisfaction.

4.6. General and Strategic Recommendations

The EDA reveals several cross-cutting insights:

  • Pricing Distribution: The catalog is dominated by mid-range titles (£20–£45), with premium books forming a small niche segment. Pricing does not correlate strongly with customer ratings.

  • Product Quality: Approximately 37.5% of books are high-rated, while a significant portion of the catalog shows lower ratings, indicating quality variability.

  • Pricing Strategy: Price and rating are statistically independent, confirming that higher quality does not systematically command higher prices.

  • Inventory: All books appear in stock, but this finding requires validation to ensure operational accuracy.

Customer Value: Value scoring highlights clear opportunities for promotions, curated collections, and recommendation optimization.

4.7. Recommendations:

  1. Marketing & Promotions
    Focus on mid-range and high-value books to maximize sales and customer satisfaction. Highlight high-rated, high-value, and under-promoted titles in curated sections and recommendation engines.

  2. Catalog Curation
    Systematically review low-rated books for improvement, repositioning, or removal to improve overall catalog quality.

  3. Inventory Verification
    Validate stock data accuracy and implement monitoring systems for high-demand titles.

  4. Enhance the dataset with:

  • Genre/category classification

  • Publisher and author metadata

  • Sales volume or popularity indicators

  • Format (paperback, hardcover, eBook)

  • Publication date

  1. Advanced Analytics
    Adopt more sophisticated approaches such as review text analysis, demand forecasting, dynamic pricing models, and recommendation optimization to support long-term growth and competitive positioning.

5.0. Limitations and Technical Considerations

While the scraping and exploratory analysis provide useful insights into the structure and characteristics of the book catalog, several limitations should be noted when interpreting the results.

5.1. Data Source and Coverage Limitations

The dataset is derived exclusively from the publicly available catalogue pages of the website. As a result, the analysis is limited to books that are actively listed and visible at the time of scraping. Products that are temporarily unavailable, discontinued, or excluded from the catalogue are not captured, which may bias conclusions related to inventory coverage and product diversity.

Additionally, the scraping scope was intentionally restricted to the first ten pages of the catalogue. While sufficient for exploratory analysis, this subset may not fully represent the complete inventory or long-tail characteristics of the full dataset.

5.2. Inventory and Availability Representation

All scraped items were reported as "in stock". This may reflect a presentation-layer design choice rather than true inventory status. If out-of-stock items are removed from public listings, the binary stock indicator will overestimate availability. Consequently, inventory-related insights should be interpreted cautiously and validated against internal inventory systems where possible.

5.3. Rating Quality and Bias

Book ratings are treated as objective indicators of product quality; however, they may be influenced by user review bias, small sample sizes, or platform-specific rating behaviors. The analysis does not account for the number of reviews per book, which limits the ability to distinguish between consistently high-performing titles and those with inflated ratings based on limited feedback.

5.4. Feature Sparsity and Omitted Variables

The dataset lacks several attributes that are known to influence pricing and customer behavior, such as genre, author reputation, publication date, format, and sales volume. The absence of these variables limits the explanatory power of correlation and value-based analyses. For example, the observed lack of correlation between price and rating may be partially explained by unobserved confounding factors.

5.5. Static Snapshot and Temporal Effects

The data represents a single snapshot in time and does not capture temporal dynamics such as price changes, stock fluctuations, or evolving customer preferences. As a result, the analysis cannot support trend detection, seasonality analysis, or demand forecasting without repeated data collection over time.

5.6. Scraping and Technical Constraints

The scraper relies on the current HTML structure and class naming conventions of the website. Any future changes to page layout, class names, or URL patterns would require updates to the extraction logic. While the approach is robust for static pages, it is not designed to handle dynamically rendered content or JavaScript-dependent elements.

6.0. Conclusion

This end-to-end data pipeline encompasses web scraping, data cleaning, normalization, and exploratory data analysis of an online book catalogue. Using a rule-based HTML parsing approach, structured product-level data was successfully extracted from a static website, transformed into analysis-ready formats, and evaluated across multiple quantitative dimensions.

The exploratory analysis revealed a catalog dominated by mid-range pricing, substantial variability in product quality as measured by ratings, and a near-zero statistical relationship between price and perceived quality. These findings suggest that pricing decisions are influenced by structural or contextual factors not captured in the current dataset, rather than customer ratings alone. Additionally, the construction of a Customer Value Score demonstrated how simple derived metrics can surface actionable insights from otherwise independent variables.

From a technical perspective, the project highlights the effectiveness of lightweight scraping tools such as Requests and BeautifulSoup for static web environments, as well as the importance of early-stage data normalization for reliable analysis. The workflow emphasizes reproducibility, interpretability, and modular design, allowing individual components (scraping, cleaning, feature engineering, and analysis) to be extended or replaced as system complexity grows.

While the analysis is constrained by data coverage, feature sparsity, and the static nature of the scrape, the resulting dataset and methodology provide a strong foundation for further development. With additional data enrichment, temporal collection, and validation against ground-truth inventory and sales systems, this pipeline could support more advanced analytical tasks such as demand forecasting, recommendation optimization, and pricing strategy evaluation.

Overall, this project demonstrates how structured data acquisition combined with principled exploratory analysis can convert unstructured web content into meaningful technical and business insights, while clearly articulating the assumptions and limitations inherent in the process.