Difference between revisions of "Cweb"

From ProgClub
Jump to: navigation, search
Line 28: Line 28:
 
We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
 
We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
  
* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", "," and "234". The word "sub-word" has sub-words "sub", "-", "word".
+
* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word".
 
* Is the word punctuation?
 
* Is the word punctuation?
 
* Is the word a number, and if so what is its value?
 
* Is the word a number, and if so what is its value?
Line 37: Line 37:
 
* Is the word an email address, or a URL?
 
* Is the word an email address, or a URL?
 
* What languages is the UTF-8 string a word in?
 
* What languages is the UTF-8 string a word in?
 +
* What senses does the word have?
  
 
Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get 8 characters before you get any savings in terms of space, and most words are less than 8 characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.
 
Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get 8 characters before you get any savings in terms of space, and most words are less than 8 characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

Revision as of 11:02, 8 August 2011

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box. Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful for particular queries, submit URLs they find as not useful for particular queries. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

  • Nominating and distributing the blacklist
  • Nominating and distributing Cweb IDs
  • Nominating and distributing Domain IDs
  • Nominating and distributing URL IDs
  • Nominating and distributing Cweb Providers
  • Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link "Cweb", then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

  • Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word".
  • Is the word punctuation?
  • Is the word a number, and if so what is its value?
  • Is the word a plural, and if so, what is the root word?
  • Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
  • Does the word have synonyms, and what are they?
  • Is the word a proper name?
  • Is the word an email address, or a URL?
  • What languages is the UTF-8 string a word in?
  • What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get 8 characters before you get any savings in terms of space, and most words are less than 8 characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.