Media Monitoring With Scraping and Proxies
Your brand defines your business — it’s much more than just a logo, a font, and some colors. More than anything else, your brand is your customers’ perception of you. Though you can’t put a price on it, there are few things more valuable to your business than your brand reputation. Many businesses spend a fortune on superficial aspects of branding but fail to give much attention to the aspects of branding that have the most influence on their customers.
Almost all customers — 97% — check product reviews before they make a purchasing decision. Media monitoring can give you a deeper understanding of how your brand is perceived, help you get on top of negative impressions before they go viral, and provide an opportunity to make favorable impressions in front of a large audience. If you have a good grasp of media monitoring or you’re just here to find out how to build a web scraper, feel free to use the table of contents to jump around to the sections most relevant to you.
What Is Media Monitoring?
Media monitoring is the process of collecting data about all mentions of your brand across different platforms, including news, social media, forums, review sites, and blogs. Understanding where your brand is talked about, as well as the positive and negative mentions and responses, provides invaluable insight into what your customers think of your business.
Media monitoring is most effectively done with web scraping. By using a web scraper, you can automatically search for mentions of your business and export them into a readable format of actionable data. Tasks that would take an employee hundreds of hours to complete can be done effortlessly with an automated script.
Why Media Monitoring Is Important
Even if your brand doesn’t have an online presence (although since 97% of businesses do, that’s likely not the case), monitoring mentions of your company is vital to its growth. We have become a data-driven society, with over 80% of companies using data as a driver of their business strategies. Companies are missing out if they’re not taking advantage of the insights to be gleaned from a focused data-gathering campaign. While there are a lot of advantages to developing a comprehensive corporate data strategy, we’re going to do a deep dive into how media monitoring can benefit your business.
Know what you’re doing right
Tracking mentions of your brand lets you see what you’re doing right. In today’s heavily online-focused culture, your customers are likely to share their positive experiences with your company. Your customers may be raving about a specific feature of your product that you considered more of an afterthought. Customer reviews may let you know that Madison in your sales department is going above and beyond whenever she interacts with customers — leading you to put her in charge of training other associates.
See where you can improve
Alternatively, you’ll also probably discover areas where you need to improve. A few complaints or negative reviews might just be unrealistic expectations or someone having a grumpy day. Repeated complaints about the same issue, however, is a strong sign that you need to investigate further — whether it’s a quality issue or a service issue.
Put out little fires before they spread
Related to showing where you can improve, monitoring your media mentions will allow you to put out small fires and turn unhappy customers into your biggest fans. Social media has changed the way we communicate on all levels. Your customers expect you to have an online presence and quickly respond to their questions and concerns — regardless of where they express them.
By staying on top of who’s talking about your company, you can quickly provide more information to undecided customers or fix any issues an unhappy customer may have. A huge benefit of rapidly and generously responding to your customers is that you’re doing it in front of a large audience. Not only are you winning over your grumpy customers but you’re also creating a favorable impression in front of all of the people who see the interaction.
Delete inaccurate information
If you find false or inaccurate information about your brand or products, you may be able to take steps and have it removed or at least provide a public rebuttal. You don’t have to let an untrue negative review or remark sit there unchallenged. Many review sites have a procedure for reporting false reviews or will at least let you respond to them.
Better understand your customers
To effectively market to your desired audience, you need to be able to meet your customers’ felt needs and communicate with them in a way that reflects their values and priorities. Media monitoring allows you a peek into your customers’ minds that is hard to get any other way.
You can learn:
- What matters to them
- How they communicate what matters
- How you can most effectively relate to them
Find out what your customers want
Tracking your brand’s online mentions can serve as a valuable source of market research. Your customers may mention that they love your product, but that they wish that it had a functionality that your design team hadn’t even considered. Tweaking your product based on customer feedback is one of the easiest, no-risk methods of product design. The best part of doing it via media monitoring is that you’re getting market research, user experience testing, and customer feedback without having to hire a UX team or beg your customers to complete yet another survey.
Encourage hesitant customers to buy
Marketers tout the “Rule of 7” when it comes to explaining a customer’s purchasing decisions. This rule states that a customer needs an average of seven positive interactions with your brand before they’ll make a purchase. While you can certainly use targeted marketing and SEO strategies to cover some of the seven interactions, personal engagement will go a long way toward encouraging a buyer who is on the fence. If your media monitoring uncovers customers who are trying to decide if they need your product or are debating between your product and a similar one offered by a competitor, engaging with them and offering to answer questions or provide additional information will go a long way towards tipping the scales in your favor.
What Is Media Monitoring and Analysis?
With so many benefits and data available to you via media monitoring, it can be overwhelming to decide where to start. Before you can implement a data-collection strategy, you need to decide what data is most valuable to your company now. When you’re starting fresh, you may not know exactly what you need to know — and that’s okay. The data you start collecting will help point you in the direction you need to go. As you start analyzing your results, you’ll become more equipped to sort out what your priorities should be.
To start, it’s a good idea to sit down and figure out what questions you’d like answered and where you can get those answers. Some questions that will benefit most businesses include:
- What are people saying about our business?
- Are our customers happy with our service?
- How are people using our products?
- Where are our customers spending time online?
- What do our customers value?
- What kind of reviews are we getting?
- What specific language do our customers use to talk about our products?
The next step is to think about where you can find the answers to these questions. If you’re trying to find out ways to reflect your customers’ values to them in your marketing campaigns, you want to find out where they are discussing what matters to them. Generally, you can engage in “social listening” on these types of websites:
- All social media channels (Different platforms are more popular with different demographics so be sure you understand where your customers are.)
- Forums
- Review sites
- Review sections of the online retailers where your products are sold
- Online news media, especially the comment sections of articles related to your industry
- Blogs of influencers that are relevant to your brand
Once you know what questions you have and where you can get the answers, you’re ready to put your plan into action.
How to Do Media Monitoring
You could put your social media intern in charge of scouring the web for mentions of your company and then entering them into a spreadsheet for analysis. However, you likely don’t have enough interns to do as thorough a job as you want — and you can probably find a better use of their time. A scraping robot does the same job — and in addition to being faster and more efficient than a human, is also much cheaper.
A web scraper is a program that automatically gathers publicly available data off of a website and exports it into an easy-to-read format so that you can analyze it and put it to good use. Web scrapers accomplish this in much the same way a human would. They go to a website, find the relevant data, note it, and move on to the next. However, they do it much, much faster than a human can — completing the task in as little as two seconds.
Scrapers will work tirelessly and continuously for you with nary a complaint — right up until the moment they get banned. Most websites don’t particularly want you to scrape their data. Sometimes their reasons are valid, such as wanting to avoid overloading their servers or trying to shut down malicious bots. Sometimes they just don’t want their competition having access to data that can benefit them. Regardless, what it means for you is that you’ll have to take measures to avoid getting your IP address banned since nothing will shut down your project faster.
Before you worry about avoiding anti-scraper protocols, however, you need to have a successful scraper. There are many readily available free and paid scraper options. However, it’s also fairly easy to build a simple one yourself.
How to Build a Media Tracking Web Scraper
You don’t have to be a programmer to build a simple media monitoring web scraper. Understanding how scrapers are built and run can help you figure out exactly what you need and let you troubleshoot if something goes wrong. To start with the very basics, there are five steps that we want our web scraper to perform:
- Request the source code of a webpage.
- Download the content that’s returned.
- Identify the data we want.
- Extract the data.
- Convert it to a usable format such as CSV.
Setting up your web scraper
We’ll be using Python to build our web scraper. You can build a web scraper using any language, but Python is a solid choice because it’s powerful, intuitive to use, and easy to learn. It also has an extensive collection of support libraries. You’ll need to have Python installed on your computer as well as a text editor such as ATOM. Feel free to use a different text editor if you prefer. You’ll also need to download two libraries for your web crawler, Beautiful Soup and Requests. Requests is a library that makes it easy to send requests by reducing the code needed and making it easier to debug. Beautiful Soup is a library that works with a parser to extract the data and convert it into a readable format.
Install Beautiful Soup and Requests using the following commands:
- pip install bs4
- pip install requests
Understanding HTML structure
Before you can build a web scraper, you need to understand how HTML is structured so that you can tell the web scraper to look for. Using the Chrome developer tool is a good way to examine HTML code in action. Click on the three dots in the top right corner of Chrome. Scroll down to click on “More Tools” and then “Developer Tools” at the bottom of the pop-out menu. This will open a new panel that shows the code for the website you’re viewing. If you want to deep dive into HTML, a good resource to check out is HTML: The Living Standard.
To give a brief run-down, however, HTML contains data in tags such as <h> for heading or <p> for paragraph. There are opening and closing tags for each element. The closing tags look the same as the opening tags except they have a </> to indicate closing. “Child” elements are nested within their “parent” tags.
An easy way to find out exactly what the HTML structure is of the element you’re interested in is to right-click on it and click “inspect.” This will open a panel similar to the “Developer Tools” panel you used earlier, only the HTML tag for the element you clicked on will be highlighted in blue. You don’t need to understand how to code an HTML website, but you do need to be able to figure out what the relevant tags are for the information you want to scrape. You’ll ask for the source code of the webpage using a GET request, and the entire page will be downloaded. You need to know where the data you’re interested in is located so that you can extract it.
If you’re having trouble figuring out what HTML tag or attributes you need, try right-clicking/inspecting several results and see what they have in common. For example, if you’re scraping a search engine site and you want to find all mentions of your brand, right-click and inspect the first few elements where the information you want is located. Finding what they have in common will tell you what attributes you need to use in your scraper.
Building your web scraper
To get started building your web scraper, import the libraries you downloaded to your project with the following commands:
- from bs4 import BeautifulSoup
- import requests
Next, you’ll need to set a variable for your URL. Just copy and paste the URL of the website you’re scraping:
- url=”url”
Create another variable for page and send a GET request using the following command:
- page = requests.get(“url”)
At this point, you can run the program to check using:
- print(page)
If everything went well with your request, you should see this response:
- <Response [200]>
If you see another code, you can find out what it indicates at troubleshoot by looking it up on the MDN Web Docs HTTP response status codes page. Once you’ve checked, delete the print(page) line so that it isn’t cluttering up your code.
Now you need to create an object using Beautiful Soup with two parameters:
- soup = BeautifulSoup(page.content, ‘html.parser’)
At this point, if you tested print(soup), you would have the entire HTML content of the webpage. However, that’s not what you want. So you’re going to tell Beautiful Soup exactly what part of the page to retrieve for you. This is why you need to know the HTML tags and attributes where your desired data is located.
This will vary depending on the website you’re scraping, but much of the data you want will likely be in the same section. Use the inspect tool to find this information. For instance, if you want to export all of the reviews for your product from a retailer, you may see a parent section or div that includes the number of stars, the title of the review, and the content of the review. You’ll set this up as a list using the class label you found with the inspection tool. Since you want to find all of these, use find_all:
- lists = soup.find_all(‘section’, class_=”stars-title-review”)
Where the class_ is, is the attribute you copied from the parent section.
This list now contains all of the sections you want to extract. Next, you’ll need to set up variables for each item. For the number of stars, you might use:
- stars = lists.find(‘a’, class_=”number of stars”)
Again, the class_ will be what you copied when you inspected the number of stars. You don’t need find_all since there is only one number of stars per review. However, you will need to loop through the lists to find the number of stars for each review. To do this, go above the previous command and add:
- for list in lists:
This will tell your scraper to search through each review to find the number of stars. Repeat this for all of the variables you need:
- title = lists.find(‘a’, class_=”title”)
- review = lists.find(‘a’, class_=”review”)
Make sure that you check the tag as well as the class, since the tag may change as well. Once you have all of the variables listed, it’s a good time to check to see if you’re on the right track. Create a new variable with all of the variables you just created and use print to see what you get:
- info = [stars, title, review]
- print(info)
At this point, you should see all of the data you requested, but it will still be in HTML format with all of the tags. Since you only want the text inside of the tags, get rid of the tags. Go back and add .text to the end of all of your variables:
- stars = lists.find(‘a’, class_=”number of stars”).text
- title = lists.find(‘a’, class_=”title”).text
- review = lists.find(‘a’, class_=”review”).text
Run print(info) again and you should have a much cleaner output. You may need to clean it up a little more, which you can do with replace. If you have some extraneous HTML code, you can replace it with an empty string to get rid of it.
Your scraper should look something like this by now:
- from bs4 import BeautifulSoup
- import requests
- url=”url”
- page = requests.get(“url”)
- soup = BeautifulSoup(page.content, ‘html.parser’)
- lists = soup.find_all(‘section’, class_=”stars-title-review”)
- for list in lists:
- stars = lists.find(‘a’, class_=”number of stars”).text
- title = lists.find(‘a’, class_=”title”).text
- review = lists.find(‘a’, class_=”review”).text
- info = [stars, title, review]
When you run print(info), you should see the raw data. To make the data easier to read, export it to a CSV file. You can do this by importing a CSV writer. After “import requests” and before your URL variable, import the CSV writer:
- from csv import writer
Above the beginning of the data you want extracted and displayed (which starts at “for list in lists:), choose a name for your CSV file and add:
- with open (‘storereviews.csv’, ‘w’, encoding=’utf8′, newline=’ ‘) as f:
- thewriter = writer(f)
- header = [‘Stars’, ‘Title’, ‘Review’]
- thewriter.writerow(header)
If you want to create a read-only file, use “r” instead of “w”. This will give you your headers. Now go to the bottom of the data (under the info variable) to format the rest of the rows.
- thewriter.writerow(info)
All together, you should have something similar to this:
- from bs4 import BeautifulSoup
- import requests
- from csv import writer
- url=”url”
- page = requests.get(“url”)
- soup = BeautifulSoup(page.content, ‘html.parser’)
- lists = soup.find_all(‘section’, class_=”stars-title-review”)
- with open (‘storereviews.csv’, ‘w’, encoding=’utf8′, newline=’ ‘) as f:
- thewriter = writer(f)
- header = [‘Stars’, ‘Title’, ‘Review’]
- thewriter.writerow(header)
- for list in lists:
- stars = lists.find(‘a’, class_=”number of stars”).text
- title = lists.find(‘a’, class_=”title”).text
- review = lists.find(‘a’, class_=”review”).text
- info = [stars, title, review]
- thewriter.writerow(info)
When you run this, you should see a CSV file added to your folder. When you click on it, it should open and all of your data will be neatly displayed in a spreadsheet. Your scraper will need to be adjusted for each website you use it on, but the basic process will remain the same. This is a basic web scraper that works with static pages.
If you’re ready to level up and move on to scraping content from dynamic sites, you’ll need to use another Python library, Selenium. You can learn more about that here.
Pitfalls in Media Monitoring
Unfortunately, having a scraper won’t get you very far without using proxies. Many websites attempt to block any type of bot activity, which is what a web scraper is. To be fair, there are valid reasons that websites have for blocking scrapers, such as to thwart malicious actors who are engaging in unethical scraping. However, for you, this means that if you run your web scraper with your normal IP address, most websites will quickly identify it as a bot, and you’ll get blocked.
Once your IP address is banned, your scraping project will be shut down. To avoid this, you need to use proxies to help your scraper avoid being identified as a bot. A proxy IP address hides your real address from the sites you visit. Just using a different IP address won’t be enough to avoid bans and downtime, however, because your new IP address will likely get banned just as quickly as your original did. The best proxy solution for web scraping projects is a rotating pool of IP addresses.
By using multiple IP addresses, you can use a different IP address for every request your web scraper makes. This makes your scraper seem like a normal human user since humans don’t make thousands of requests per minute. Rotating proxies also mitigate the effect of bans by automatically changing IP addresses if one does get banned.
Choosing a Proxy for Your Media Monitoring Tool
You have a lot of options when it comes to proxies. The best type of proxy for you depends on factors such as what you’re using it for and your budget. While there’s no one-size-fits-all proxy solution, there are definitely options that are a better choice for specific use cases. First, let’s talk about the types of proxies you can use.
Data center proxies
These proxy IP addresses originate in a data center so it’s obvious that they aren’t coming from ordinary users. When you use a data center proxy, it acts as an intermediary between you and the website you’re scraping. First, you send a request to the data center. Then the data center gives you a new IP address and sends your request to the website. The website responds to the data center, and the data center passes the response back to you.
Data center proxies are good for some use cases, such as gaming, that may rely on speed and where mimicking human behavior isn’t crucial. These are the most common and usually the cheapest type of proxies.
Advantages of data center proxies include:
- Price, since data center proxies are almost always the cheapest option
- Speed, since data center proxies are faster than residential proxies
- Volume, which you’ll need because data center proxies are frequently banned
- Anonymity
There are some downsides to data center proxies, however.
Disadvantages of data center proxies include:
- Frequent bans that often include entire subnets
- Blocked entirely by some sites
Mobile proxies
Mobile proxies are IP addresses linked to mobile devices such as your phone or tablet. These are the most expensive type of proxies. They can be valuable for use cases such as mobile user experience testing, ad verification, or other situations where you only need mobile-based data. However, they’re not the best option for web scraping because of their price and limitations.
Residential proxies
Residential proxies are IP addresses that are issued by internet service providers and linked to a specific address. This is the type of IP address that you probably have at your home or business. When a website sees a residential proxy IP address, it assumes it’s coming from a real user unless it starts acting like a robot. Residential proxies are the best option for web scraping projects because they are the type of IP address that most people use. It’s impossible for anti-scraping protocols to detect that they are a bot solely based on IP address.
That isn’t to say that residential IP addresses won’t get banned, however. If you just switch a residential IP proxy for your regular IP address, your project will still get banned quickly. The solution to this is proxy pools. Since we’re talking about using proxies for web scraping projects, that’s what we’ll discuss. Rotating through a pool of residential proxies reduces the chances that you’ll get banned or encounter a CAPTCHA. If one IP address does get banned, another will quickly replace it.
The advantages of rotating residential proxies are:
- Authenticity because they’re real IP addresses
- Less likely to get banned
- Reliability
- Optimized for web scraping
The main disadvantage of rotating residential proxies is their price. They cost more than data center proxies, but they’re worth the investment if your company depends on the valuable insights that media monitoring can provide.
Public proxies
Public proxies are freely available for anyone to use. While that sounds fantastic, there are some significant disadvantages to public proxies that make them unsuitable for almost all uses. Public proxies are susceptible to hackers. Since anyone can access public proxies, you may be compromising your data by using them. Additionally, there have been cases of public proxies being offered by bad actors — specifically for hacking.
Even if security wasn’t an issue, public proxies simply don’t perform well. Because they are so widely available, they’re often overcrowded which means they’re too slow to be useful. They are also prone to bans since you have no control over who uses them and for what purpose. If someone is using the same IP address as you and gets banned from a website, you’re banned as well. While it would be nice to be able to use a free solution, when it comes to proxies, you definitely get what you pay for.
Shared private proxies
Shared private proxies are a proxy pool that’s owned by a private company that allows only authorized users to access it. Shared proxies can be subject to the same problems and limitations as public proxies, but not on the same scale. Because these users have to be authorized by the company that owns them, there are far fewer users. However, they may still be overloaded and slow.
Depending on how well the company vets its users, some people may still be using them unethically, so security can be an issue. The “bad neighbor” effect —which is when you get banned because of someone else’s activity — can still be a problem as well. The main appeal of shared private proxies is that they’re cheaper, however, they’re not an effective option for businesses that are using proxies to perform media monitoring and analysis.
Semi-dedicated proxies
These are shared proxies but they’re usually far more limited. Semi-dedicated proxies may be shared among just a few other people. This can be a good option if you have an ethical, reliable proxy provider. Because fewer people are using it, semi-dedicated proxy pools are less likely to be as sluggish as standard shared ones. If your proxy provider doesn’t do their due diligence, however, you can still end up with some of the same problems that shared proxies present.
Private proxies
Private proxies are more expensive than semi-dedicated or shared proxies, but they are usually the best option for businesses and organizations that take data collection seriously. As long as you’re buying your private proxies from a trustworthy company, you have much more control and security when using them. Because you are the only one who has access to your private proxy pool, you don’t have to worry about the bad neighbor effect. Additionally, private proxies are much faster since you’re the only one using them. Your web scraping projects will be less likely to get stuck or slowed down.
Private proxies offer the highest degree of security and anonymity. Purchasing private proxies from a reputable company helps ensure your data won’t be stolen or misused. Reliability is another reason to choose private proxies. The fact that they’re faster and less likely to get banned means that you can focus on analyzing your data to drive better business practices — instead of trying to get your web scraper up and going again.
Choosing the Right Proxy Provider
There are several considerations to address when choosing a proxy provider. One of the most important factors when using residential proxies is making sure they’re ethically sourced. In addition to being the right thing to do, using ethically sourced residential proxies protects your business from the negative associations that come from dealing with untrustworthy vendors.
If your proxy provider isn’t completely transparent about how they source their residential proxies, consider that a red flag. Many companies bury their terms of service in small type at the bottom of a long agreement. Their end-users may not even realize they’ve agreed for their IP address to be used. It’s more expensive and complicated to source residential IP addresses with integrity — but it’s essential to protect your business brand.
You also need a proxy provider who is reliable and offers good customer support. Web scraping can be complicated and confusing. If your proxy provider isn’t there to help you troubleshoot issues, you may be faced with extensive downtime — even after you’ve spent more on high-end proxies.
Rayobyte Proxy Solutions
At Rayobyte we offer the best proxy solutions for your business. Rayobyte brings the same unfaltering dedication to sourcing residential proxies as we do to all aspects of our business. We make sure all of our residential IP addresses come from people who knowingly and intentionally agree to its use. Our end-users are fairly compensated and clearly understand the term of their agreement. We also allow them to control how their IP addresses are used, and we give them the freedom to opt out at any time. Our end-users know who to contact if they have any issues.
We’re not only committed to ethical sourcing practices but we also make sure our customers are engaging in ethical use cases. We don’t allow our proxies to be used for black-hat practices. We’re dedicated to protecting your reputation — and ours as well. We want to partner with you to help you find data solutions that work for you. Reach out to our world-class customer service team today to find out how we can help you reach your goals.
Final Thoughts
Media monitoring is one of the most important things you can do to drive the success of your business. Staying on top of your brand reputation, understanding what your customers want, and engaging with them regularly will give you insights that will improve every aspect of your company’s operations, from customer service to product development. Get started with our rotating residential proxies — the best choice for web scraping — to ensure you’re able to make the most of your media monitoring strategy.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.