Kudzu.com Data Scraping: May 2015

Friday, May 29, 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get HTML Document text and that text then parsed to extract data from the HTML codes. Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service. If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library. PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Tuesday, May 26, 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday, May 25, 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006

Saturday, May 23, 2015

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data – Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

•    minimal manual intervention
•    low on effort and time
•    can work on any scale

Cons-

•    Data quality compromised
•    inaccurate and incomplete datasets
•    lesser details suited only for high-level analyses
•    Suited for gathering- blogs, forums, news
•    Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets – If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

•    High data quality
•    Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering – any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday, May 20, 2015

The Features of the "Holographic Meridian Scraping Therapy"

1. Systematic nature: Brief introduction to the knowledge of viscera, meridians and points in traditional Chinese medicine, theory of holographic diagnosis and treatment; preliminary discussion of the treatment and health care mechanism of scraping therapy; systematic introduction to the concrete methods of the holographic meridian scraping therapy; enumerating a host of therapeutic methods of scraping for disorders in both Chinese and Western medicine to embody a combination of disease differentiation and syndrome differentiation; and summarizing the health care scraping methods. It is a practical handbook of gua sha.

2. Scientific: Applying the theories of Chinese and Western medicine to explain the health care and treatment mechanism and clinical applications of scraping therapy; introducing in detail the practical manipulations, items for attention, and indications and contraindications of the scraping therapy. Here are introduced representative diseases in different clinical departments, for which scraping therapy has a better curative effect and the therapeutic methods of scraping for these diseases. Stress is placed on disease differentiation in Western medicine and syndrome differentiation in Chinese medicine, which should be combined in practical application.

Although there are more than 140,000 kinds of disease known to modem medicine, all diseases are related to dysfunction of the 14 meridians and internal organs, according to traditional Chinese medicine. The object of scraping therapy is to correct the disharmony in the meridians and internal organs to recover the normal bodily functions. Thus, the scraping of a set of meridian points can be used to treat many diseases. In the section on clinical application only about 100 kinds of common diseases are discussed, although the actual number is much more than that. For easy reference the "Index of Diseases and Symptoms" is appended at the back of the book.

3. Practical: Using simple language and plenty of pictures and diagrams to guarantee that readers can easily leam, memorize and apply the principles of scraping therapy. As long as they master the methods explained in Chapter Three, readers without any medical knowledge can apply scraping therapy to themselves or others, with reference to the pictures in Chapters Four and Five. Besides scraping therapy, herbal treatment for each disease or syndrome is explained and may be used in combination with the scraping techniques.

Referring to the Holographic Meridian Hand Diagnosis and pictures at the back of the book will enhance accuracy of diagnosis and increase the effectiveness of scraping therapy.

Since the first publication and distribution of the Chinese edition of the book in July 1995, it has been welcomed by both medical specialists and lay people. In March 1996 this book was republished and adopted as a textbook by the School for Advanced Studies of Traditional Chinese Medicine affiliated to the Institute of the Acupuncture and Moxibustion of the China Academy of Traditional Chinese Medicine.

In order to bring this health care method to more and more people and to make traditional Chinese medicine better appreciated They have modified and replenished this book in the spirit of constant improvement. They hope that they may make a contribution to the health care of mankind with this natural therapy which has no side-effects and causes no pollution.

They hope that the Holographic Meridian Scraping Therapy can help the health and happiness of more and more families in the world.

Source: http://ezinearticles.com/?The-Features-of-the-Holographic-Meridian-Scraping-Therapy&id=5005031

Wednesday, May 13, 2015

Web Scraping: Startups, Services & Market

I got recently interested in startups using web scraping in a way or another and since I find the topic very interesting I wanted to share with you some thoughts. [Note that I’m not an expert. To correct me / share your knowledge please use the comment section]

Web scraping is everything but a new technique. However with more and more data shared on internet (from user generated content like social networks & review websites to public/government data and the growing number of online services) the amount of data collected and the use cases possible are increasing at an incredible pace.

We’ve entered the age of “Big Data” and web scraping is one of the sources to feed big data engines with fresh new data, let it be for predictive analytics, competition monitoring or simply to steal data.

From what I could see the startups and services which are using “web scraping” at their core can be divided into three categories:

•    the shovel sellers (a.k.a we sell you the technology to do web scraping)

•    the shovel users (a.k.a we use web scraping to extract gold and sell it to our users)

•    the shovel police (a.k.a the security services which are here to protect website owners from these bots)

The shovel sellers

From a technology point of view efficient web scraping is quite complicated. It exists a number of open source projects (like Beautiful Soup) which enable anyone to get up and running a web scraper by himself. However it’s a whole different story when it has to be the core of your business and that you need not only to maintain your scrapers but also to scale them and to extract smartly the data you need.

This is the reason why more and more services are selling “web scraping” as a service. Their job is to take care about the technical aspects so you can get the data you need without any technical knowledge. Here some examples of such services:

    Grepsr
    Krakio
    import.io
    promptcloud
    80legs
    Proxymesh (funny service: it provides a proxy rotator for web scraping. A shovel seller for shovel seller in a way)
    scrapingHub
    mozanda

The shovel users

It’s the layer above. Web scraping is the technical layer. What is interesting is to make sense of the data you collect. The number of business applications for web scraping is only increasing and some startups are really using it in a truly innovative way to provide a lot of value to their customers.

Basically these startups take care of collecting data then extract the value out of it to sell it to their customers. Here some examples:

Sales intelligence. The scrapers screen marketplaces, competitors, data from public markets, online directories (and more) to find leads. Datanyze, for example, track websites which add or drop javascript tags from your competitors so you can contact them as qualified leads.

Marketing. Web scraping can be used to monitor how your competitors are performing. From reviews they get on marketplaces to press coverage and financial published data you can learn a lot. Concerning marketing there is even a growth hacking class on udemy that teaches you how to leverage scraping for marketing purposes.

Price Intelligence. A very common use case is price monitoring. Whether it’s in the travel, e-commerce or real-estate industry monitoring your competitors’ prices and adjusting yours accordingly is often key. These services not only monitor prices but with their predictive algorithms they can give you advice on where the puck will be. Ex: WisePricer, Pricing Assistant.

Economic intelligence, Finance intelligence etc. with more and more economical, financial and political data available online a new breed of services, which collect and make sense of it, are rising. Ex: connotate.

The shovel police

Web scraping lies in a gray area. Depending on the country or the terms of service of each website, automatically collecting data via robots can be illegal. Whatever the laws say it becomes crucial for some services to try to block these crawlers to protect themselves. The IT security industry has understood it and some startups are starting to tackle this problem. Here are 3 services which claim to provide solutions to stop bots from crawling your website:

•    Distil
•    ScrapeSentry
•    Fireblade

From a market point of view

A couple of points on the market to conclude:

•    It’s hard to assess how big the “web scraping economy” is since it is at the intersection of several big industries (billion dollars): IT security, sales, marketing & finance intelligence. This technique is of course a small component of these industries but is likely to grow in the years to come.

•    A whole underground economy also exists since a lot of web scraping is done through “botnets” (networks of infected computers)

•    It’s a safe bet to say that more and more SaaS (like Datanyze pr Pricing Assistant) will find innovative applications for web scraping. And more and more startups will tackle web scraping from the security point of view.

•    Since these startups are often entering big markets through a niche product / approach (web scraping is a not the solution to everything, there are more a feature) they are likely to be acquired by bigger players (in the security, marketing or sales tools industries). The technological barrier are there.

Source: http://clementvouillon.com/article/web-scraping-startups-services-market/

Sunday, May 3, 2015

Web Data Scraping - Scrape Business Data in no time

The Internet has evolved as one of the largest repositories of information for your business. You can design intelligent business processes to access a whole host of relevant information sources that will help you strategize, implement and deliver effective business objectives. Leveraging the benefits and usefulness of Web Scraping Tools is one such methodology that most businesses have adopted. Let us take a look at some of the ways it helps you easily scrape data relevant for your business.

Scraping for Business Information

Web Data Scraping is a technique, employed by most organizations. It involves the implementation of tools that help businesses extract unstructured data and convert them into usable business information. The focus of most scraping initiatives revolves around the organization’s need to glean the following information:

•   Competitor analysis to structure and strategist effectively

•   Price comparisons to price their products competitively

•   Customer feedbacks to enhance their product portfolio and provide customers with better brand experience   Market dynamics to help them identify areas of opportunities and threats

Using Scraping Tools

The abundance of information available on the Internet that helps you build up a productive business strategy can be easily extracted and leveraged to benefit your business. Tools have been designed with intuitive interface and intelligent algorithms which help in furthering this end.

Website Data Scraping tools are equipped for compatibility with a wide variety of applications so as to be able to explore a huge range of information sources. These tools are fully automated and display the drag and drop facility ensuring users get to leverage the benefits of speed and convenience.

Data extraction tools are not only adept at extracting data, but are also equally well-equipped to combine relevant statistics from several social media platforms like YouTube, Twitter, and Google Analytics and so on. This helps businesses to analyse trends and plan strategies accordingly.

Challenges of the Data Scraping Process

Just as there is no dearth of data to be collected from the Web, there is also an abundance of web scraping tools to execute the data collection process. However, the capability of the tool to help you collect the appropriate data needs to be assured before you can proceed with its implementation. Some of the challenges faced by most businesses owing to their wrong choice of tools include the following:

•   Run-of-the-mill extraction tools are unable to scale up sufficiently in order to capture large volumes of data

•   Some tools are also unable to establish compatibility with most data sources and therefore do not provide a holistic data collection approach

•   Some tools are also not equipped to conduct an automatic detection of updates made to a data source and therefore end up providing inaccurate data.

In the light of all this it is essential that you identify the right tool for your need and select one that is embedded with an updated technology to help you achieve the following:

•   Ensure that you are able to access the appropriate data that you want

•   Help you structure it in the format you want

•   Provide quick and easy access to all available data sources no matter how complex

•   Run accurately and is a reliable source to help you churn out usable information.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html