v/programming: How can one Collect the details of articles from websites?

How can one Collect the details of articles from websites?

1 25 May 2018 15:11 by u/Plant_Boy

When you post a link or video to Voat, it has the ability to collect the title of the media or document and label it as the title for the post.

My understanding is that it collects the meta data from the host website and extracts the title?

Though I do not understand the process that well, does anyone know how to extract the title, author/publisher, date and other data from a link?

Is there an HTML command? I want to pass the received data through some Python logic to apply filters, but that's my future work.

17 comments

0 u/RustyEquipment 25 May 2018 15:16

DISCLAIMER: Not an expert, kinda rusty at HTML.

HTML is an XML based language that, when written out has tags. <html> </html> <title> <Body>

In order to grab the title or such you would need to know the tag and then you should be able to parse it out. As far as how to do that... not an expert, but that would be a little bit more understanding I think. You would be looking for a sort of XML Parse function to grab the Title Tag....

corrections?

0 u/Plant_Boy [OP] 25 May 2018 15:40

Gives me a direction to start looking!

0 u/Reddit_traitor 25 May 2018 15:23

cURL

https://en.wikipedia.org/wiki/CURL

0 u/Plant_Boy [OP] 25 May 2018 15:40

Ta, I'll have a look there and see what I can divulge!

0 u/Reddit_traitor 25 May 2018 16:28

i figured the wiki would get a you good start

0 u/Veridic 25 May 2018 16:57

Beautiful soup, python.

0 u/Plant_Boy [OP] 25 May 2018 17:13

This looks like something I'd like to bodge into my code! Thanks!

0 u/qwop 25 May 2018 23:22

Extracting the title of a web page is a very simple process.

The HTML specification says each web page should have a <title> section in the header of the page document. This <title> section is what you see on the top of your web browser as the name of the web page of whatever tab you have open.

All Voat is doing when you post a link, is load the target web page, then look at the HTML code, and extract the text in the <title> tag.

Here's an example in Python, using lxml (or you could use beautifulsoup too):

import lxml.html
# Load and parse the target web page
doc = lxml.html.parse(url)
# Search for the <title> tag, and extract the text within + print the result
print doc.find(".//title").text

0 u/TheOmniscientOne 27 May 2018 05:52

https://stackoverflow.com/questions/4348912/get-title-of-website-via-link


<?php
function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}
//Example:
echo get_title("http://f4ct.co/");
?>

0 u/TheOmniscientOne 27 May 2018 06:12

I would try a method that only grabbed headers, but it would be difficult to implement across all the various server types. What I am saying is if you could just grab the header without the rest of the document that would be great. Some coders would pass the response directly into buffers and close the connection as soon as it saw the </head> tag but I think that would be rude. [And likely to get your bot blacklisted]

Here's some background on headers:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

0 u/Plant_Boy [OP] 27 May 2018 10:42

Php may be something I have to look more into.

I'm attempting to learn programming in my spare time and python has been the dominant language in that time.

0 u/TheOmniscientOne 27 May 2018 17:45

Stick with Python vs. php. I only use php because at the time I developed some of my first big apps it had more modules and portability. Python is better.

0 u/Plant_Boy [OP] 27 May 2018 18:37

I suppose it's the Age vs Potential thing. PHP has a lot of developed features due to it being used more frequently in the past, but Python has more potential, it just needs to be unlocked?

0 u/TheOmniscientOne 27 May 2018 20:49

Everything else being equal, Python is more elegant.

I am sticking with my language right now because my codebase uses lots of tricks like case statements [switch in php] which actually don't even evaluate/parse past the case statements and is therefore an easy way to segregate functions for extra security. I also misuse the heck out of MYSQL variables although that is pretty portable as long as I can get DB hooks for the same generation of MYSQL.

Python seems plenty mature now if I were to start over. I have used it for small web interfaces and it's great.

0 u/infamousEB 27 May 2018 11:24

Check out BeautifulSoup

0 u/Plant_Boy [OP] 27 May 2018 11:52

Someone else mentioned it and I think it has a relevant features!

0 u/psimonster 30 May 2018 00:53

For general HTML parsing with Python: https://docs.python.org/3/library/html.parser.html .