The Article Extractor is an API that takes a URL and returns a JSON object that contains parsed elements from the article. The parser will only work with websites that are in an article format such as newspapers, blogs or magazines. The successful extraction of the various parts cannot guaranteed because of the many possibilities that websites could be formatted. We find that the parser works for the majority of examples.
Select an example article below, a real request to the API is made but is limited to a choice of three articles to prevent abuse. If you want to use the API follow the Usage instructions below.
{
"title": "Air traffic controllers release timelapse tour of UK airspace",
"author": "",
"published": null,
"url": "http://www.bbc.co.uk/news/uk-30108947",
"image": "http://news.bbcimg.co.uk/media/images/79117000/jpg/_79117585_79117513.jpg",
"videos": [],
"keywords": [
"ukclaire",
"airspace",
"nats",
"showing",
"timelapse",
"air",
"controllers",
"tour",
"traffic",
"uk",
"release",
"video",
"thousands",
"skies",
"typical"
],
"summary": "A timelapse video showing thousands of planes coming in and out of the UK has been released by NATS.",
"body": "A timelapse video showing thousands of planes flying in and out of the UK has been released by NATS (National Air Traffic Services).\\\n\\\nAround 6,000 flights take off and land across the region during an average 24-hour period.\\\n\\\nAir traffic experts collected radar data to create an overview of a typical day in the skies above the UK.\\\n\\\nClaire Brennan reports.\\\n\\\nFootage courtesy of NATS."
}
Field | Type | Details |
---|---|---|
title | String | Article title from social media meta tags, falling back to the <title> tag |
author | String | Author of the article |
published | Date | Date of publication |
url | URL | Article URL |
image | URL | Main image |
videos | Array | Array of embedded video URLs |
keywords | Array | Array of keywords that appear in the article |
summary | String | Summary of the article |
body | String | Main body of the article |
You can use the following cURL command to add a document:
Note: you’ll need to replace the text
API_KEY
with your API key.
curl --request GET \
--url 'https://document-parser-api.lateral.io/?url=http://www.bbc.com/news/31047780' \
--header 'content-type: application/json' \
--header 'subscription-key: API_KEY'
Let’s say that you have the URL http://www.bbc.com/news/31047780
and you want to parse it. The first thing that you will need to do is get an API key.
Pipe output of the cURL command to the Python:
curl --request GET \
--url 'https://document-parser-api.lateral.io/?url=http://www.bbc.com/news/31047780' \
--header 'content-type: application/json' \
--header 'subscription-key: API_KEY' | python -m json.tool
The output of the cURL command will be a JSON object as specified above. If you want to pretty print the returned JSON object for testing then (if you’re using Python 2.6+) you can pipe the output of the cURL command to the Python.
To call the API in your programming language of choice, check out the API specification where there are code samples available.
Simply enter your details below and we'll email your API key to you!
We will process your data as described in our Terms of Use, Privacy Policy and Data Processing Agreement