Article Extractor

Pricing

Our pricing is based on the usage and support level required. Please share an estimate of the amount of articles you will be parsing monthly. We will look at this to provide you with a quote.

Introduction

The Article Extractor is an API that takes a URL and returns a JSON object that contains parsed elements from the article. The parser will only work with websites that are in an article format such as newspapers, blogs or magazines. The successful extraction of the various parts cannot guaranteed because of the many possibilities that websites could be formatted. We find that the parser works for the majority of examples.

Try it out

Select an example article below, a real request to the API is made but is limited to a choice of three articles to prevent abuse. If you want to use the API follow the Usage instructions below.

Visual

JSON

{
      "title": "Air traffic controllers release timelapse tour of UK airspace",
      "author": "",
      "published": null,
      "url": "http://www.bbc.co.uk/news/uk-30108947",
      "image": "http://news.bbcimg.co.uk/media/images/79117000/jpg/_79117585_79117513.jpg",
      "videos": [],
      "keywords": [
        "ukclaire",
        "airspace",
        "nats",
        "showing",
        "timelapse",
        "air",
        "controllers",
        "tour",
        "traffic",
        "uk",
        "release",
        "video",
        "thousands",
        "skies",
        "typical"
      ],
      "summary": "A timelapse video showing thousands of planes coming in and out of the UK has been released by NATS.",
      "body": "A timelapse video showing thousands of planes flying in and out of the UK has been released by NATS (National Air Traffic Services).\\\n\\\nAround 6,000 flights take off and land across the region during an average 24-hour period.\\\n\\\nAir traffic experts collected radar data to create an overview of a typical day in the skies above the UK.\\\n\\\nClaire Brennan reports.\\\n\\\nFootage courtesy of NATS."
    }

JSON spec

Field	Type	Details
`title`	String	Article title from social media meta tags, falling back to the `<title>` tag
`author`	String	Author of the article
`published`	Date	Date of publication
`url`	URL	Article URL
`image`	URL	Main image
`videos`	Array	Array of embedded video URLs
`keywords`	Array	Array of keywords that appear in the article
`summary`	String	Summary of the article
`body`	String	Main body of the article

Usage

You can use the following cURL command to add a document:

Note: you’ll need to replace the text API_KEY with your API key.

curl --request GET \
  --url 'https://document-parser-api.lateral.io/?url=http://www.bbc.com/news/31047780' \
  --header 'content-type: application/json' \
  --header 'subscription-key: API_KEY'

Let’s say that you have the URL http://www.bbc.com/news/31047780 and you want to parse it. The first thing that you will need to do is get an API key.

Pipe output of the cURL command to the Python:

curl --request GET \
  --url 'https://document-parser-api.lateral.io/?url=http://www.bbc.com/news/31047780' \
  --header 'content-type: application/json' \
  --header 'subscription-key: API_KEY' | python -m json.tool

The output of the cURL command will be a JSON object as specified above. If you want to pretty print the returned JSON object for testing then (if you’re using Python 2.6+) you can pipe the output of the cURL command to the Python.

To call the API in your programming language of choice, check out the API specification where there are code samples available.