my.blog

My.Projects

Game Baker Easy to use, graphical Game Designer for linux.

Social Comic Book Display your twitter posts in a comic book layout.

Seam Resizer Implementation of seam removal and insertion for photo editing.

More:

Viral Ad Network Make money from your website by showing viral ads on your site.

Santa's Snowy Workshop A highly playable Christmas Real Time Strategy game..

My.Papers

Average Views on YouTube The average daily views/video on YouTube doubles at the end of 2007.

My.Blog

Nerdy news updates and articles
Tim Wintle
Fig 1

Tim Wintle's Blog

Tim works at Team Rubber, where he uses Python, large computers, and some clever maths to look at the web in new ways. In his free time he codes various other bits of software, and web apps.

.

Tue, 08 Jan 2008

Matt Cutts Ranks for Viagra

When the wikipedia search engine Wikia launched a couple of days ago Google's anti-spam man Matt Cutts was quick to welcome them into the search space with an article on his blog.
With a screenshot from what I assume is the backend of wikia (it was a screenshot of the apache project's Nutch search engine project - I have not found where he got it) he commented that SEOs (Search Engine Optimisers) would be quick off the mark to try to work out how knowing that your page had a field co-norm of 0.2187 would mean that it would rank highly for the term "viagra".
Unfortunately for him his site is held in such regard by search engines that the mere comment of the word "viagra" sent his face direct to the top of the wikia rankings for that term.
I find this interesting for several reasons. Firstly, wikia obviously has some fairly good anti-spam systems- think of all the sites out there that have been trying to rank for that term. The fact that they picked up on mattcutts.com (which is a well-respected website) rather than any of the spammy sites is a really good sign.
Secondly, his name shot to the top very fast. Wikia seems to have launched with the same speed of crawling as Google now has - this is going to be a tighter battle than we may have thought.
I asked Matt if he thought that the openness of Wikia (which is completely open source and allows webmasters to see the internal workings of their search engine) may make his job of detecting web-spam harder. He was unsure if Wikia would create a new generation of SEOs who actually understood the inner workings of a search engine, or if this would make a difference, but he was sure that "some SEOs are brushing up on phrases like IDF. ". (note: IDF = Inverse Document Frequency)
Ironically, this may (at least initially) work in his favour. Not so long ago the talk on the SEO forums was about how to increase your term frequencies (which is often also one of the best measures of spam - although most SEOs ignored that minor point), now perhaps people will start data mining a bit more to find out what words are actually important. Of course this, too is something that can be picked up on by Matt Cutt's team at the Googleplex, but at least the average SEO may get a better view of the system they are trying to work with.
Of course, if you are looking at the new search engines, why not try the viral video search engine I have been working on - focused on discovery of new video content (or sign up to get custom video feeds for your site, or if you are a webmaster looking for income from viral video advertisments check out the new ad system I have also helped with


TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/Search/matt-cutts-ranks-viagra.trackback

Wed, 21 Nov 2007

Whats on Gods Ipod?

If you even need to ask what is on gods ipod - why Led Zeppelin of course.


TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/Search/whats-on-gods-ipos.trackback

Tue, 16 Oct 2007

Viral Sauce - pre-alpha

So, I have posted a bit about my work on the Viral Content Network (Developer Blog), but now you can actually give it a test-drive.
Viral Sauce is a video search engine using the technology I have been working on, which tries to match you up with what it hopes will be interesting content for you.
Here is where Viral Sauce is different - rather than searching for a particular video that you happen to know the title to - Viral Sauce is designed to let you enter a search term based on what kind of content you would like to see (e.g. search for "cats"). The site will try to find content that is relevant to cats, while trying to keep the results fresh and interesting.
In order to try to work out what you would like, we took a different approach from the other video search engines, developing vRank (one of the ingredients in our "Secret Sauce"), which is our measure of how popular a video is to the appropriate viewer. No simply measuring the "Most Viewed" videos - giving content that fits the lowest common denominator - our tests have shown that vRank helps viral sauce to find content that you probably have not seen before, which is probably more relevant to your search.
Please note the index is still growing fast, so there are still some search phrases that do not bring up any results, but that will decrease as we crawl more and more videos.


TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/Search/viral-sauce-video-search-engine.trackback

Fri, 18 May 2007

New Look Google, but no Surprises

The changes to Google announced on Wednesday at the Searchology day constitute the single largest update to the search giant ever, but was any of it that surprising?

Let us recap, the main new announcements (which went live a few hours later) were the introduction of a subtle new navigation menu in the top left, the launch of Google experimental (for all the Google obsessed), the news that Google Video will now index videos from all over the web (and not just Google Video and Youtube), and (most notably) the introduction of Google Universal Search. Google Universal Search means that results from all the Google search services (Web, News, Images, Video, Scholar, Foreign Language search ...) will now all be ranked against each other for how well they suit your query.

This finally explains why Google have been focusing on all of these separate search engines for years, while not even placing an obvious link to most of them - they have been beta testing in preparation for their next step towards their goal of "indexing the worlds knowlenge". It also explains the focus that googlers, from Larry Page to Matt Cutts and Vanessa Fox, have been paying to consolidation of the different aspects of Google services when asked what to look out for in from the Googleplex this year.

But for those keeping up with search none of this came as a surprise, I have been using the new menu system for about a month thanks to Google Operating system, who posted a way to force your browser to display the new version. The experimentation with the one box (which is now replaced by the results shown inside the main web results) has been fairly blatent, too.

For Phill Midwinter (a fellow Brit), it was even more of a shock, though. Having suggested in an article a few months ago that search companies should work on nearly all of the improvements announced. While he is keen to point out that he is not suggesting that Google copied his ideas, he is clearly dissapointed by the lack of drastic action taken by Google.

While I agree with the expectation that Google Experimental should have some truely groundbreaking work on there (rather than two of the "Experiments" being putting the navigation bar on the left, and on the right respectively), I think that Google have handled this very well. For a company that became popular largely due to its simplified look and ease of use, any change was going to be hotly talked about, Google have done what a few years ago would have been unthinkable - added more things to their pages - but they have done it in a way that is still simple to view and to use.

I am also very impressed by the relative ranking of the results from multiple search indexes. For a start, for most searchs you now do on Google, your search is being done not through one set of data (technically it was two, the Google index and the Suplimental index), but through a whole selection of different databases. Secondly, they are managing to compare these results to each other. I have not yet stumbled upon any publication/patent that gives me a precise idea of how they are doing it, but I would guess there has been enough interesting research on diffeomorphisms and co-ordinate transforms to keep asperin supplies flowing into the Googleplex for the past few months as engineers try not to let their heads burn out.



TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/Search/new-google-search.trackback

Sun, 08 Apr 2007

Why tagging can take us furthur away from the semantic web?

There is a lot of talk these days about the "Semantic Web", and some purists suggesting that we tag all data. I believe that, while tagging may be wide-spread now due to it's relative ease of implementation, it is likely to hurt the long-term aims of the semantic web.

Firstly, let us define semantics. Here is the wikipedia definition:

Semantics ... refers to the aspects of meaning that are expressed in a language, code, or other form of representation. Semantics is contrasted with two other aspects of meaningful expression, namely, syntax, the construction of complex signs from simpler signs, and pragmatics, the practical use of signs by agents or communities of interpretation in particular circumstances and contexts. ...semantics may also denote the theoretical study of meaning in systems of signs.

So, the idea is that every item on the web can be uniquely categorised by some series of symbols, which occur within an alphabet. In everyday linguistics, we would take the symbols to be words, and the alphabet to be all the words in the dictionary. Notice that this is separate from the order of the words and punctuation.

Regarding natural language, there are two possibilities:

  • Language is fully capable of describing the entire concept of a document
  • Language can only describe a subset of concepts
Most people (including myself) would fall into the first category. We can then separate this into two furthur groups:
  • The semantics of natural language (i.e. words used) are fully capable of describing the entire concept of a document
  • Language can only describe all concepts when it includes the syntax and pragmatics
Here I would fall into the second category. It seems that an unstructured list of words cannot describe a document uniquely in its entirity. For a (very basic) example of the problem, "Suits Black Cat" and "Black Cat Suits" are two very different concepts, but they are semantically identical. (Note to replies - I know there is not technically an isomorphism here, but I do not want to get too deep into the maths/philosophy of this here. If someone has evidence against this please comment).

Now for some comments on tagging:

  1. Tagging tend to be taken from a smaller alphabet than words used in articles / web pages / full transcripts (in the case of video/audio). Basically, in the full text, an author will probably have used more than one synonym, where in selecting tags, people are more likely to choose the most commonly used word.
  2. Tagging removes punctuation. This is not technically removing any semantics from those used in the text, however it is perfectly possible to create semantics describing a page which relate to the grammar and linguistics. This is an ability to effectively increase the alphabet size that is missed by tagging.
  3. Tagging only uses one occurrence of each tag - this removes the ability to make use of the density of a word. Imagine you are putting up some new shelves. You measure your wall to see how long you want them, but your tape measure only has two marks, 0 and 1. Your wall is nearer 1, so you go to Ikea to get your shelves (which are also marked 0 or 1), and just have to hope that they fit.
Clearly, then, tagging effectively provides a lower number of available semantics that can be used for classification than natural language. This reduces the number of unique items that can be described using semantics directly derived from tags.

But how can this harm the semantic web? I hear you ask. Well, the more that tagging gets used, the more that we change the distribution of these words in our overall semantic, making it harder for people to fairly extract semantic data in the future.

In conclusion, if you are designing a site with tagging, that is all very well for usability, and for the semantic web in the stage we are at. All this tagging may, however, have a detrimental effect on the growth of the true semantic web, so please try to separate them off, and make it clear they are tags, as this will make it much easier for future algorithms, and the evolution of the web.



TrackBack ping me at:
http://www.timwintle.co.uk/blog.pl/Search/tagging-semantic-web.trackback