Difference between revisions of "Cweb"

From ProgClub
Jump to: navigation, search
(Reverting spam.)
 
(15 intermediate revisions by 3 users not shown)
Line 10: Line 10:
  
 
* [[User:John|John]]
 
* [[User:John|John]]
 +
 +
All contributors have agreed to the terms of the [[ProgClub:Copyrights#ProgClub_projects|Contributor License Agreement]]. This excludes any upstream contributors who tend to have different administrative frameworks.
  
 
== Copyright ==
 
== Copyright ==
Line 15: Line 17:
 
Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
 
Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
  
== System design documentation ==
+
== Source code ==
 +
 
 +
Subversion project isn't configured yet.
 +
 
 +
== Links ==
 +
 
 +
* [https://github.com/deoxxa/mitsukeru mitsukeru] on [http://www.reddit.com/r/programming/comments/krzys/show_proggit_im_working_on_a_search_engine_does/ reddit]
 +
 
 +
== TODO ==
 +
 
 +
Things to do, in rough order of priority:
 +
 
 +
* Create project in svn
 +
 
 +
== Done ==
 +
 
 +
Stuff that's done. Latest on top.
 +
 
 +
* [[User:John|JE]] 2011-08-08: started documentation
 +
 
 +
== System design ==
  
 
Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.
 
Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

Latest revision as of 16:08, 5 July 2012

This is a draft document, very much a work in progress. See the talk page for notes and caveats. See Projects for other projects.

Project status

In the planning phase.

Contributors

Members who have contributed to this project. Newest on top.

All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.

Copyright

Copyright 2011, Contributors. Licensed under the GPL.

Source code

Subversion project isn't configured yet.

Links

TODO

Things to do, in rough order of priority:

  • Create project in svn

Done

Stuff that's done. Latest on top.

  • JE 2011-08-08: started documentation

System design

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.

Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

  • Nominating and distributing the blacklist
  • Nominating and distributing Cweb Providers (name and 32-bit ID)
  • Nominating and distributing Domain IDs (name and 32-bit ID)
  • Nominating and distributing URL IDs (URL and 64-bit ID)
  • Nominating and distributing Query IDs (string and 64-bit ID)
  • Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link Cweb, then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

  • Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
  • Is the word punctuation?
  • Is the word a number, and if so what is its value?
  • Is the word a plural, and if so, what is the root word?
  • Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
  • Does the word have synonyms, and what are they?
  • Is the word a proper name?
  • Is the word an email address, or a URL?
  • What languages is the UTF-8 string a word in?
  • What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).

It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.

Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.

There will be a table with the schema,

result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );

In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.

Other tables in the system will be,

cweb ( cweb_id, cweb_name ) key ( cweb_id );
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
url ( url_id, url, canonical_url_id ) key ( url_id );
query ( query_id, query ) key ( query_id );

There will be the following set of tables too,

cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
domain_source( domain_id, source_id ) key ( domain_id, source_id );
url_source ( url_id, source_id ) key ( url_id, source_id );
query_source ( query_id, source_id ) key ( query_id, source_id );
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );

The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.

There will also be the following tables,

http_content ( url_id, time_of_retrieval, mime_type, content );
html_content ( url_id, time_of_index, total_size, heading_text, 
  navigation_text, content_text, alt_text, title_text );
link_content ( source_url_id, url_id, title_text, anchor_text );

The master database will have the following additional tables,

cweb_availability ( cweb_id, url_blocks, queries_per_month );
cweb_url_allocation ( cweb_id, block_id );
cweb_query_allocation ( cweb_id, year, month, count );

URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.

Cweb sites that are not the master will have a table,

query_max ( last_block_id );
cweb_query ( block_id, cweb_id, quota );

In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.