Difference between revisions of "Cweb"

From ProgClub
Jump to: navigation, search
(Vegans - Can They will Be Successful in Sports?)
(Reverting spam.)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Vegans - Can These people Be Successful in Sports?
+
This is a draft document, very much a work in progress. See the [[Talk:Cweb|talk page]] for notes and caveats. See [[Projects]] for other projects.
+
 
There will be a typical belief that to succeed in sport, you'll need to nibble on meats and drink milk. It is thought by many that vegans won'big t have the mandatory strength or stamina to beat meat eaters. These beliefs are false and based on a lack of knowledge.  
+
== Project status ==
+
 
The 'proof' that'ersus sometimes offered is actually that there are hardly any vegans who are at the top in sporting endeavours. That is faulty logic which could only always be applied if there were the equal number of vegans to meat eaters.  
+
In the planning phase.
+
 
There are precious handful of vegans in the world. To get the top in almost any sport you have to the dedication and focus to achieve the top when there are so many distractions that could stop you. Not many people have that dedication. You will need the right genes to offer you the border over your opponents. Very number of have the right genes that will make them champions.  
+
== Contributors ==
+
 
If there is only, say, 1 individual in 400 who will be vegan, exactly what are the chances that that 1 particular person is the one who gets the ruthless dedication and the right genes for the sport they are serious in? What's the opportunity that they will will have had the right encouragement or influences when young that will bring them into that sport? You would always be much safer betting that a meat eater would have individuals things because there are 399 beef eaters and only 1 vegan. We'd have to pin most our hopes on that 1 vegan to emerge with everything needed to be considered a champion. Your money would be much less dangerous betting that a single in the 399 beef eaters would have what it takes. It's a numbers online game: double the number of vegans and you also'll double the number of vegan champions.  
+
Members who have contributed to this project. Newest on top.
+
 
In the UK there are meant to always be about 250,000 vegans out of the population of sixty million. That is actually about 1 man or woman in 240. Some will have been vegan for just a few months. Some will revert to being beef eaters or lacto-ovo vegetarians. There is truly a great even smaller portion of vegans in some other countries. It is actually my guestimate that extended-term vegans tend to be more likely to become lower than 1 in 400 or even 1 in 500. If you have a band of 400 just how many will have the genes for being a champion? Very number of. Exactly how most of that very handful of will have the determination? Very few. Exactly how a lot of abdominal muscles number of (in the very few) is going to be vegan? Most likely not even one. More probably people people will likely be beef eaters. But vegans carry out still manage to grow to be champions against those odds. Strange, isn'n it that the still common perception of vegans is actually of weedy, skinny, weak and unhealthy people?
+
* [[User:John|John]]
+
 
There are a handful of vegan champions but exactly why aren't there far more if it is becoming a wholesome lifestyle? There are so few vegan champions because there are so few vegans. Just how many ginger-haired, left-presented with sportsmen called Alphonse are champions? None at all. Not because an individual prefer that is truly incapable of sporting success but because there are so handful of of them.  
+
All contributors have agreed to the terms of the [[ProgClub:Copyrights#ProgClub_projects|Contributor License Agreement]]. This excludes any upstream contributors who tend to have different administrative frameworks.
+
 
Most top sportspeople are single minded inside their pursuit of excellence. That they won't let anything get in their way. These are willing to provide up family life, friendships and amusement time to focus on training. These are ready to risk their health, while may be seen in the number who are prepared to take dangerous efficiency enhancing drugs. They're willing to train to excess to this kind of a great extent that their immune methods are weakened. They care nothing about the possibility of suffering coming from arthritis in later a long time as a result of punishing their physiques in training and competition.  
+
== Copyright ==
+
 
Winning is everything to them. They are like fanatics. And, like fanatics, nothing else matters because much while the object of their desire. Compassion for farm pets is of little importance to them in assessment. Thus, this particular fanaticism will prevent many individuals who might have turn into vegan coming from doing so because coming from a great early age, like we all, they have been indoctrinated with the lies that various meats and milk are necessary for good health. This particular lie reduces the number of players and sportspeople who could turn into vegan and who could go on to glory in the sporting arena. Being a champion is more important to them than being a vegan. The handful of vegan champions are individuals who don'big t believe the lies about various meats or individuals who set compassion first.
+
Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
+
 
There are a good few vegan sportsmen and women who frequently beat beef eaters. I will still only talk about a number of while representatives with the vegan sporting world.  
+
== Source code ==
+
 
Mac Danzig won his King of the Cage fighting title while a vegan. You should be tough to survive in that type of contest and yet he thrived and prospered.
+
Subversion project isn't configured yet.
+
 
Carl Lewis has said that his best performances on the running track came when he was actually following a vegan diet.
+
== Links ==
+
 
Scott Jurek may be the a number of winner of 100-mile races and twice winner in the Badwater Ultra marathon, which will be run over a course of 135 miles. The race starts in Death Valley, at 280 feet below sea level and finishes at Mount Whitney Portal, which is 8.360 feet above sea level. That'utes a 135 miles course over three mountain ranges with a cumulative ascent of 13,000 feet and also a cumulative descent of 4,700 feet. You must be tough just to think about doing it.  
+
* [https://github.com/deoxxa/mitsukeru mitsukeru] on [http://www.reddit.com/r/programming/comments/krzys/show_proggit_im_working_on_a_search_engine_does/ reddit]
+
 
Brendan Brazier will be a vegan along with a professional Ironman Triathlete, twice winner with the Canadian Ultra Marathon championship.  
+
== TODO ==
+
 
So, it's possible for vegans to become world champions in both sprinting and endurance events. But what about strength sports? Can vegans end up being powerful? Or can they be top bodybuilders? Can these people develop upwards formidable strength or huge muscle bulk?
+
Things to do, in rough order of priority:
+
 
The answer is truly (you've guessed it): 'indeed!'.  
+
* Create project in svn
+
 
There are many very strong vegans who train with weights. There are a good couple of impressive bodybuilders who have built up their bulk on vegan diets.  
+
== Done ==
+
 
But where are all the vegan Olympic weightlifting champions and powerlifting world record holders, then? Where may be the vegan who provides won the World's Strongest Man title?
+
Stuff that's done. Latest on top.
+
 
Provide it time. As I said above, there aren't enough vegans coming from whose ranks these people can emerge. It will happen. It is truly happening.
+
* [[User:John|JE]] 2011-08-08: started documentation
+
 
There are two vegan strength champions who arrive at mind, though. Both women. Pat Reeves - she'ersus a world class powerlifter. Many times the British powerlifting champion. And Jane Black olympic weight lifter who provides set records in masters' lifting events.  
+
== System design ==
+
 
What about the men? Perhaps also many male strength sportsmen are worried about not obtaining sufficient of their usual slaughterhouse products. Once more, offer it time for the truth to attain them. There are many vegans in training, while might be seen in the vegan fitness and bodybuilding forums. Wait right up until that they start to achieve much more success after which the timid meats eaters will see that these people have nothing to fear in giving up the meats and milk that their mummies told them they had you can eat to grow upward big and strong. That they will realise that real men don'n need you can eat beef.  
+
Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.
+
 
What about vegan bodybuilders? Until a very several years ago there weren't any special products for vegan bodybuilders. Various meats eaters were spoilt for choice but vegans had no choice because there wasn't anything to select. Very number of bodybuilders rely on just normal food. They will take products in the form of powders and tablets. And many (most pro kinds?) take dangerous and against the law drugs. Most of them have muscles that are partly the product or service in the chemistry lab. Anybody who could build huge muscle groups on a beef-based diet could carry out so on a vegan diet.  
+
The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.
+
 
Not everyone can build competition-winning muscle tissue. Once more, the vegan who does so must have the right genes. And the time and dedication. He has to be that rare individual who just happens to have most the right attributes. Not much chance that there are many vegans who are like that. A lot more probably that an individual through the huge the greater part of meats eaters will have what is needed. You tend to be more probably to locate a top athlete or a Nobel Prize Winner in Scotland than on the Isle of Man. Not because the Scots are inherently superior to the Manx people. But because there will be more of them.  
+
Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.
+
 
Don'big t believe the lies with the vested interests of the meat and milk industries. That they have invested heavily in cruelty and they need to retain the people convinced that the slaughter and abuse of their victims is necessary for the continued health of humans.  
+
There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.
+
 
Believe instead the many healthy, powerful and fit vegans who daily prove how healthy the vegan diet is. There's nothing humans need that can't be obtained coming from a well balanced vegan diet. A vegan diet will be suitable for humans of every age, since the American Dietetic Association and Dietitians of Canada acknowledge.  
+
Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.
http://stayvegan.com
+
 
http://www.siteszones.com/search.php?search=http://stayvegan.com
+
There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on [http://www.progsoc.org/wiki/Morpheus Morpheus] for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:
 +
 
 +
* Nominating and distributing the blacklist
 +
* Nominating and distributing Cweb Providers (name and 32-bit ID)
 +
* Nominating and distributing Domain IDs (name and 32-bit ID)
 +
* Nominating and distributing URL IDs (URL and 64-bit ID)
 +
* Nominating and distributing Query IDs (string and 64-bit ID)
 +
* Coordinating the Word database
 +
 
 +
Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.
 +
 
 +
Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.
 +
 
 +
Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).
 +
 
 +
Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link [http://www.progclub.org/wiki/Cweb Cweb], then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.
 +
 
 +
We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.
 +
 
 +
We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
 +
 
 +
* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
 +
* Is the word punctuation?
 +
* Is the word a number, and if so what is its value?
 +
* Is the word a plural, and if so, what is the root word?
 +
* Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
 +
* Does the word have synonyms, and what are they?
 +
* Is the word a proper name?
 +
* Is the word an email address, or a URL?
 +
* What languages is the UTF-8 string a word in?
 +
* What senses does the word have?
 +
 
 +
Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.
 +
 
 +
The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).
 +
 
 +
It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.
 +
 
 +
Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.
 +
 
 +
There will be a table with the schema,
 +
 
 +
result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );
 +
 
 +
In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.
 +
 
 +
Other tables in the system will be,
 +
 
 +
cweb ( cweb_id, cweb_name ) key ( cweb_id );
 +
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
 +
url ( url_id, url, canonical_url_id ) key ( url_id );
 +
query ( query_id, query ) key ( query_id );
 +
 
 +
There will be the following set of tables too,
 +
 
 +
cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
 +
domain_source( domain_id, source_id ) key ( domain_id, source_id );
 +
url_source ( url_id, source_id ) key ( url_id, source_id );
 +
query_source ( query_id, source_id ) key ( query_id, source_id );
 +
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );
 +
 
 +
The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.
 +
 
 +
There will also be the following tables,
 +
 
 +
http_content ( url_id, time_of_retrieval, mime_type, content );
 +
html_content ( url_id, time_of_index, total_size, heading_text,
 +
  navigation_text, content_text, alt_text, title_text );
 +
link_content ( source_url_id, url_id, title_text, anchor_text );
 +
 
 +
The master database will have the following additional tables,
 +
 
 +
cweb_availability ( cweb_id, url_blocks, queries_per_month );
 +
cweb_url_allocation ( cweb_id, block_id );
 +
cweb_query_allocation ( cweb_id, year, month, count );
 +
 
 +
URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.
 +
 
 +
Cweb sites that are not the master will have a table,
 +
 
 +
query_max ( last_block_id );
 +
cweb_query ( block_id, cweb_id, quota );
 +
 
 +
In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

Latest revision as of 16:08, 5 July 2012

This is a draft document, very much a work in progress. See the talk page for notes and caveats. See Projects for other projects.

Project status

In the planning phase.

Contributors

Members who have contributed to this project. Newest on top.

All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.

Copyright

Copyright 2011, Contributors. Licensed under the GPL.

Source code

Subversion project isn't configured yet.

Links

TODO

Things to do, in rough order of priority:

  • Create project in svn

Done

Stuff that's done. Latest on top.

  • JE 2011-08-08: started documentation

System design

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.

Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

  • Nominating and distributing the blacklist
  • Nominating and distributing Cweb Providers (name and 32-bit ID)
  • Nominating and distributing Domain IDs (name and 32-bit ID)
  • Nominating and distributing URL IDs (URL and 64-bit ID)
  • Nominating and distributing Query IDs (string and 64-bit ID)
  • Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link Cweb, then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

  • Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
  • Is the word punctuation?
  • Is the word a number, and if so what is its value?
  • Is the word a plural, and if so, what is the root word?
  • Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
  • Does the word have synonyms, and what are they?
  • Is the word a proper name?
  • Is the word an email address, or a URL?
  • What languages is the UTF-8 string a word in?
  • What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).

It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.

Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.

There will be a table with the schema,

result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );

In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.

Other tables in the system will be,

cweb ( cweb_id, cweb_name ) key ( cweb_id );
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
url ( url_id, url, canonical_url_id ) key ( url_id );
query ( query_id, query ) key ( query_id );

There will be the following set of tables too,

cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
domain_source( domain_id, source_id ) key ( domain_id, source_id );
url_source ( url_id, source_id ) key ( url_id, source_id );
query_source ( query_id, source_id ) key ( query_id, source_id );
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );

The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.

There will also be the following tables,

http_content ( url_id, time_of_retrieval, mime_type, content );
html_content ( url_id, time_of_index, total_size, heading_text, 
  navigation_text, content_text, alt_text, title_text );
link_content ( source_url_id, url_id, title_text, anchor_text );

The master database will have the following additional tables,

cweb_availability ( cweb_id, url_blocks, queries_per_month );
cweb_url_allocation ( cweb_id, block_id );
cweb_query_allocation ( cweb_id, year, month, count );

URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.

Cweb sites that are not the master will have a table,

query_max ( last_block_id );
cweb_query ( block_id, cweb_id, quota );

In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.