Post 6. Processing Words

With the design work progressing I focused more on the technicals of sourcing related media. This first one using this million headline dataset and a string similarity library to see how much similarity I needed to actually get related articles, not just ones with the same words.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/89cd5dd3-5575-47ab-b46c-bf3845307e03/Screen_Recording_2020-04-25_at_16.48.16.mp4

I didn’t want to limit this project to just media with imagery as this would violate the goals so I began thing about how I can supplement the imagery. The article about explains a process of using word recognition to suggest images, but I fail to see how this could work for more tense subjects beyond wine and harvesting as given in the examples.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/bc4b0938-3f69-44a0-af95-414fb93e1735/Untitled1.mp4

Next I tried experimenting with the popularity of words used in headlines. This one using NewsAPI. This searches the news for the word you select and only highlights news terms. Hoping to fins a pattern of repeated and new words. It kind of works as a visualisation tool but headlines continues to be unique and didn’t return me in a loop. Excellent for discovering other stuff but not related stuff.

https://dull-glittery-gambler.glitch.me

At this point I moved form the rate limited API’s to using the Google News RSS feed, which isn’t rate limited and gives 100 articles from a much larger range of publishers. In the above image I am just loading related articles based on the top stories form Google News. So thats about 4,000 headlines in this example.

https://fantastic-gem-cat.glitch.me/

Then I searched for the open graph og:image for each of these articles to look for patterns. The biggest issue here is the excessive use of both stock imagery and the same licensed images. An understandable constraint but not one the public has on social media.

Using the standard news API’s tended to serve me either very PC uninformative articles or a very broad range. I found that the r/worldnews/controversial page had much better content, partially due to the human curation. I began using this as my API point for getting news stories instead.

https://lime-spotless-shoulder.glitch.me/

To search for more stories I need terms related to the story. In the page above I compare generating keywords using the keyword-extractor module from the headline and body. And the keywords from the open graph age data. Then using this to search for related tweets.

https://twitter.com/tesseralis/status/1260421410192650241?s=21

Another issue with programmatic word recognition is its complete lack of understanding. This is from a project to remove hate speech.

So far the best way to aggregate quality content is to use the aggregation made by people.