Educate Yourself
EBook: Web Content Extraction and Transformation That Scales

Massively Scalable

If you need ongoing content from 100,000 websites – or a million items from one site – Connotate delivers. Our platform easily handles hundreds of thousands of websites and terabytes of data a day. Our website extraction agents combine blazing speed with precise targeting to deliver relevant, high-quality content on a massive scale.

We also radically reduce the cost of creating agents. With our patented visual approach, non-technical users simply tag the content they want using our built-in web browser. Connotate then takes care of all the underlying complexity. Creating a Connotate agent takes as little as minutes – and doesn’t need any expensive coding skills.

section image
section image

Easily Handle Dynamic Content

Today, more than 60% of websites generate dynamic content. Extracting this content is incredibly difficult with traditional techniques. Most web scrapers just don’t find it, and custom scripts that collect dynamic content are extremely complex and break easily.

Connotate understands dynamic content. Our advanced machine-learning algorithms know how to extract sites that use JavaScript and Ajax, giving you access to the hidden web. Best of all, we do this automatically – so there’s no need to analyze site code when building agents.

Built-In Change Detection

Web content changes all of the time – whether that’s updated newsfeeds or changing prices on competitive sites. However, looking for changes manually is hugely expensive and error-prone – and handling change detection during post-processing puts an unsustainable load on your downstream systems.

There’s a better way. Connotate analyzes content for changes before it arrives at your downstream processing systems. Simply click to enable change monitoring, and Connotate will alert you whenever there’s a change – identifying deltas right down to the character level.

section image
section image

Powerful Data Preprocessing

Connotate has powerful data manipulation capabilities that dramatically simplify downstream processing. By preprocessing content as soon as they extract it, our agents shield your downstream systems from site-specific content structure, and eliminate laborious, error-prone content formatting.

When creating a Connotate agent, users can easily manipulate content using a simple point-and-click interface. This includes normalizing content across multiple websites. Connotate can also link content automatically to its associated metadata, making it much simpler to join data across sources, resolve common entities, and manage content rights.

Dramatically Lower Maintenance Costs

Website format changes are a daily fact of life. Unfortunately, when a format changes, traditional approaches – such as custom scripts – almost inevitably break, even if the format change is small. That’s because scripts look at the detailed HTML on webpages, so the slightest change creates havoc. It’s often difficult to detect when a script breaks – and fixing it is a major effort. Clients tell us that 200 scripts per programmer are often the limit – after that, they spend the vast majority of their time on maintenance.

Connotate has a different approach. Instead of going through HTML in excruciating detail, our agents use adaptable visual rules to see websites the way you do. This approach is an order of magnitude more resilient – in fact, our agents survive the majority of website format changes. Typically, two non-technical staff can maintain 10,000 Connotate agents – an order of magnitude better than with custom scripts.

We also use advanced pattern recognition techniques to assess the quality of data and to detect when our agents do break. We automatically alert you when this happens, and give you the data you need to resolve the problem quickly.

section image
section image

Automated and Integrated

Connotate automates your entire data supply chain, including extraction, transformation, normalization and content delivery. Our powerful scheduling capabilities allow you to schedule individual agents and groups of similar agents, giving you fine-grained control over content extraction and delivery. This lets you extract based on how often a site is updated and how critical the data is, while maintaining a balanced flow of content into your downstream systems.

We support a wide range of delivery formats that integrate directly into your downstream processing, including XML, HTML, Email, CSV and XLS. We can also deliver data directly into your MongoDB or SQL databases. Connotate also has a rich Web Services API for tight integration into your existing systems and workflows.

Request a consultation Learn More