Tips for data scraping from the web

Our data team shares their best advice

The Economist

Owen Winter

James Fransham

, and 2 others

Sep 03, 2025

Before you start, consider that not all scraping is legal. However, sometimes data that should be accessible are protected by anti-scraping mechanisms. These can be bypassed. For example, you can emulate requests coming from a real browser. Python packages such as this one will help you generate fake “user agents”, a string describing the make and build of your browser, to put in the headers of your HTTP request. You might want to implement a so-called “debounce” to limit how fast you are scraping a website. Many web servers will block your IP address if they detect an unusually high amount of traffic. If you need to scrape something speedily (for example, when you have a deadline to meet), then try randomising your IP address with this package.

The Economist’s China data journalist

The network panel is the first thing to check when you want to scrape a website. Open up the developer tools (Mac users can press command-option-I in Chrome) on your chosen website and switch to the “Network” tab. Here you’ll find every file the page loaded. Often this will include the data file you’re looking for. You can use the filter to find “.csv” or “.json” files the page loaded. Alternatively the “Fetch/XHR” filter will show all the files the page draws upon after it initially loads. You can click on each file to see its contents and find the one that has the data you’re looking for.

Evan Hensleigh
Interactive data journalist

Using code to scrape data is great, but sometimes you just want to quickly grab a column of data from a PDF or Word document. If you end up with the whole table in your clipboard when copying it, that doesn’t mean you need to get into a scrape. Instead, try holding the Alt or Option key when you click and drag: often you can select just what you need.

Alex Selby-Boothroyd
Head of data journalism

Regular readers of this newsletter will know how much I think users of R—especially me—owe a debt of gratitude to Hadley Wickham, a Kiwi-born data scientist. Scraping is no exception. Mr Wickham created an R package called Rvest (get the pun?) to make scraping pages super smooth. With the package you can read HTML pages and pick out particular elements such as links, images, tables or text. I reckon about 95% of all the web scraping I have done over the past decade makes use of Rvest. The package also knits together with all the other libraries in the Tidyverse, which makes scraping, say, 1,000 webpages of tables nothing but a joy.

James Fransham
Britain data journalist

Most of the time, scraping doesn’t need to be covert. It’s good practice to ask for permission first, if it’s practical to do so. With the polite package in R, you can also ask for permission virtually. The package works with Rvest and allows you to give a “reverential bow” to websites before you scrape them. The bow() function reads any rules set by the website’s owners and leaves a virtual calling card, so they can identify you. The nod() function alerts the website which page you are scraping. Together, they reduce the risk that you break a website’s terms of service or overload its servers.

Owen Winter
Political data journalist

Always check if the website offers an API (Application Programming Interface), which is a much better way to access data. Typically, APIs are straightforward to use and return clean data in a structured and easy-to-use format, such as a JSON file. If no API is available, inspect whether the page content is dynamically loaded (see Evan’s advice above). If so, Python packages such as selenium can automate a browser to mimic human actions. If you need to scroll to a particular place on the page to click a button to load more content, selenium allows you to do that with commands for how far to scroll down and where to click. If the website has a search function that doesn’t give you precise enough results, you can take advantage of Google’s powerful search capabilities. Build a web crawler that sends queries to Google and filters for that specific website along with custom keywords. That way you’re essentially outsourcing the website search to Google’s engine, and then you can scrape the results.

Dolly Setton
Data journalist

This edition of Off the Charts was first sent to The Economist subscribers on July 1st 2025.