Difference between revisions of "Cweb"

Latest revision as of 15:08, 5 July 2012

This is a draft document, very much a work in progress. See the talk page for notes and caveats. See Projects for other projects.

Project status

In the planning phase.

Contributors

Members who have contributed to this project. Newest on top.

John

All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.

Copyright

Source code

Subversion project isn't configured yet.

Links

mitsukeru on reddit

TODO

Things to do, in rough order of priority:

Create project in svn

Done

Stuff that's done. Latest on top.

JE 2011-08-08: started documentation

System design

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.

Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

Nominating and distributing the blacklist
Nominating and distributing Cweb Providers (name and 32-bit ID)
Nominating and distributing Domain IDs (name and 32-bit ID)
Nominating and distributing URL IDs (URL and 64-bit ID)
Nominating and distributing Query IDs (string and 64-bit ID)
Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link Cweb, then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
Is the word punctuation?
Is the word a number, and if so what is its value?
Is the word a plural, and if so, what is the root word?
Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
Does the word have synonyms, and what are they?
Is the word a proper name?
Is the word an email address, or a URL?
What languages is the UTF-8 string a word in?
What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).

It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.

Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.

There will be a table with the schema,

result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );

In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.

Other tables in the system will be,

cweb ( cweb_id, cweb_name ) key ( cweb_id );
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
url ( url_id, url, canonical_url_id ) key ( url_id );
query ( query_id, query ) key ( query_id );

There will be the following set of tables too,

cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
domain_source( domain_id, source_id ) key ( domain_id, source_id );
url_source ( url_id, source_id ) key ( url_id, source_id );
query_source ( query_id, source_id ) key ( query_id, source_id );
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );

The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.

There will also be the following tables,

http_content ( url_id, time_of_retrieval, mime_type, content );
html_content ( url_id, time_of_index, total_size, heading_text, 
  navigation_text, content_text, alt_text, title_text );
link_content ( source_url_id, url_id, title_text, anchor_text );

The master database will have the following additional tables,

cweb_availability ( cweb_id, url_blocks, queries_per_month );
cweb_url_allocation ( cweb_id, block_id );
cweb_query_allocation ( cweb_id, year, month, count );

URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.

Cweb sites that are not the master will have a table,

query_max ( last_block_id );
cweb_query ( block_id, cweb_id, quota );

In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

@@ Line 1: / Line 1: @@
-Vegans - Can They End up being Successful in Sports?
+This is a draft document, very much a work in progress. See the [[Talk:Cweb|talk page]] for notes and caveats. See [[Projects]] for other projects.
+== Project status ==
-There will be the sole belief that to succeed in sport, you'll need to consume meat and drink milk. It is thought by many that vegans won'capital t have the required strength or stamina to beat various meats eaters. These beliefs are false and based on a lack of knowledge.
+In the planning phase.
-The 'proof' that's occasionally offered will be that there are hardly any vegans who are at the top in sporting endeavours. This really is faulty logic which could only become applied if there were a helpful equal number of vegans to meats eaters.
+== Contributors ==
-There are precious handful of vegans in the world. To get the top in sport you'll need the dedication and focus to get to the top when there are so many distractions that could stop you. Not many people have that dedication. You need the right genes to offer the edging over your competition. Very number of have the right genes that might make them champions.
+Members who have contributed to this project. Newest on top.
-If there is only, say, 1 individual in 400 who will be vegan, what are the probabilities that that 1 man will be the one who gets the ruthless dedication and the right genes for the sport these are involved in? What's the chance that they will have had the right encouragement or influences when young that will bring them into that sport? You would become much less risky betting that a meats eater would have individuals things because there are 399 meats eaters and only 1 vegan. We'd have to pin just about all our hopes on that 1 vegan to emerge with everything needed to be a champion. Your money would become much reliable betting that one particular from the 399 beef eaters would have what it takes. It'ersus a numbers sport: double the number of vegans and you'll double the number of vegan champions.
+* [[User:John|John]]
-In the UK there are going to end up being about 250,000 vegans out of a population of 50 million. That is truly about 1 individual in 240. Some will have been vegan for just a couple months. Some will revert to being various meats eaters or lacto-ovo vegetarians. There will be a helpful even smaller per cent of vegans in some other countries. It is actually my guestimate that long-term vegans will be more most likely to always be under 1 in 400 or even 1 in five-hundred. If you have a group of 400 exactly how many will have the genes for being a champion? Very few. How most of that very handful of will have the determination? Very handful of. How many of abdominal muscles few (from the very few) will be vegan? Almost certainly not even one particular. Much more most likely these people will probably be beef eaters. But vegans perform still have the capacity to end up getting champions toward all those odds. Strange, isn'testosterone it that the still common perception of vegans will be of weedy, skinny, weak and unhealthy people?
+All contributors have agreed to the terms of the [[ProgClub:Copyrights#ProgClub_projects|Contributor License Agreement]]. This excludes any upstream contributors who tend to have different administrative frameworks.
-There are a handful of vegan champions but the actual reason why aren'n there a lot more if it is definately a wholesome lifestyle? There are so few vegan champions because there are so couple of vegans. How many ginger-haired, left-transferred sportsmen called Alphonse are champions? None at almost all. Not because somebody prefer that is truly incapable of sporting success but because there are so few of them.
+== Copyright ==
-Most top sportspeople are single minded within their pursuit of excellence. That they won'capital t let anything get of their way. They are willing to offer upward family life, friendships and leisure time time to focus on training. These are ready to risk their health, since may be seen in the number who are able to take dangerous overall performance enhancing drugs. These are willing to train to excess to such an extent that their immune programs are weakened. They will care nothing about the possibility of suffering coming from arthritis in later decades as a result of punishing their systems in training and competition.
+Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
-Winning is everything to them. They are like fanatics. And, like fanatics, nothing else matters like much while the object of their desire. Compassion for farm animals is actually of little importance to them in evaluation. Thus, this unique fanaticism will prevent many individuals who might have end up getting vegan from doing so because through a great early age, like all of us, these people have been indoctrinated with the lies that meat and milk are necessary for health. This specific lie reduces the number of players and sportspeople who could end up being vegan and who could go on glory in the sporting arena. Being a champion is truly far more important to them than being a vegan. The couple of vegan champions are those that who don'testosterone believe the lies about various meats or individuals who placed compassion first.
+== Source code ==
-There are a serious couple of vegan sportsmen and women who often beat meat eaters. I will simply discuss a handful of since representatives with the vegan sporting world.
+Subversion project isn't configured yet.
-Mac Danzig won his King with the Cage fighting title since a vegan. You must be tough to survive in that type of contest and yet he thrived and prospered.
+== Links ==
-Carl Lewis offers said that his best performances on the operating track came when he has been following a vegan diet.
+* [https://github.com/deoxxa/mitsukeru mitsukeru] on [http://www.reddit.com/r/programming/comments/krzys/show_proggit_im_working_on_a_search_engine_does/ reddit]
-Scott Jurek may be the a number of winner of 100-mile races and twice winner from the Badwater Ultra marathon, which is actually run over a course of 135 miles. The competition starts in Death Valley, at 280 feet below sea level and finishes at Mount Whitney Portal, which will be 8.360 feet above sea level. That'ersus a 135 miles course over three mountain ranges with a cumulative ascent of 13,000 feet and also a cumulative descent of 4,700 feet. You should be tough just to think about doing it.
+== TODO ==
-Brendan Brazier will be a vegan plus a professional Ironman Triathlete, twice winner of the Canadian Ultra Marathon championship.
+Things to do, in rough order of priority:
-So, you'll be able for vegans to be world champions in both sprinting and endurance events. But what about strength sports? Can vegans always be strong? Or can that they always be top bodybuilders? Can they create upwards formidable strength or huge muscle bulk?
+* Create project in svn
-The answer will be (you've guessed it): 'indeed!'.
+== Done ==
-There are many very powerful vegans who train with weights. There are a significant few impressive bodybuilders who have built way up their bulk on vegan diets.
+Stuff that's done. Latest on top.
-But where are all the vegan Olympic weightlifting champions and powerlifting world record holders, then? Where is the vegan who offers won the World's Strongest Man title?
+* [[User:John|JE]] 2011-08-08: started documentation
-Give it time. When I said above, there aren'big t sufficient vegans coming from whose ranks these people can emerge. It will happen. It is happening.
+== System design ==
-There are two vegan strength champions who come to mind, though. Both women. Pat Reeves - she'ersus a world class powerlifter. Many times the British powerlifting champion. And Jane Black olympic weight lifter who offers set data in masters' training events.
+Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.
-What about the men? Perhaps as well many male strength players are worried about not obtaining adequate of their usual slaughterhouse products. Once again, provide it time for issue to get to them. There are many vegans in training, like can be seen in the vegan fitness and bodybuilding forums. Wait till they will start to achieve far more success after which the timid beef eaters will see that they have nothing to fear in giving up the meats and milk that their mummies told them they will had to nibble on to grow upward big and powerful. These people will realise that real men don't need to consume meat.
+The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.
-What about vegan bodybuilders? Right up until a very couple of years ago there weren'testosterone any special dietary supplements for vegan bodybuilders. Meat eaters were spoilt for choice but vegans had no choice because there wasn'testosterone anything to pick. Very handful of bodybuilders rely on just normal foods. These people take health supplements in the form of powders and harmful drugs. And many (most pro options?) take dangerous and unlawful drugs. Most of them have muscle groups that are partly the product in the chemistry lab. Any individual who could develop huge muscle groups on a various meats-based diet could perform so on a vegan diet.
+Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.
-Not everyone can construct competition-winning muscle tissue. Once again, the vegan who does so must have the right genes. And the time and dedication. He has to be that rare individual who just happens to have a lot of the right attributes. Not much possibility that there are many vegans who are like that. Much more probably that someone through the huge largest percentage of meats eaters will have what becomes necessary. You tend to be probable to come across a top athlete or a Nobel Prize Winner in Scotland than on the Isle of Man. Not because the Scots are inherently superior to the Manx people. But because there tend to be of them.
+There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.
-Don'n believe the lies in the vested hobbies with the beef and milk industries. They have invested seriously in cruelty plus they need to keep the people convinced that the slaughter and abuse of their victims will be necessary for the continued health of humans.
+Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.
-Believe instead the many healthy, strong and fit vegans who daily prove exactly how healthy the vegan diet is. There's nothing humans need that can not be obtained from a well balanced vegan diet. A vegan diet is suited to humans of every age, like the American Dietetic Association and Dietitians of Canada acknowledge.
+There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on [http://www.progsoc.org/wiki/Morpheus Morpheus] for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:
-For more info about veganism and just about all the benefits, remember to visit http://www.thesaucyvegan.com where you'll discover helpful and informed individuals who is going to be pleased to share their knowledge with you.
+* Nominating and distributing the blacklist
-http://cafedigest.com/search.php?search=http://stayvegan.com
+* Nominating and distributing Cweb Providers (name and 32-bit ID)
+* Nominating and distributing Domain IDs (name and 32-bit ID)
+* Nominating and distributing URL IDs (URL and 64-bit ID)
+* Nominating and distributing Query IDs (string and 64-bit ID)
+* Coordinating the Word database
+Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.
+Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.
+Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).
+Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link [http://www.progclub.org/wiki/Cweb Cweb], then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.
+We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.
+We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
+* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
+* Is the word punctuation?
+* Is the word a number, and if so what is its value?
+* Is the word a plural, and if so, what is the root word?
+* Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
+* Does the word have synonyms, and what are they?
+* Is the word a proper name?
+* Is the word an email address, or a URL?
+* What languages is the UTF-8 string a word in?
+* What senses does the word have?
+Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.
+The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).
+It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.
+Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.
+There will be a table with the schema,
+ result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );
+In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.
+Other tables in the system will be,
+ cweb ( cweb_id, cweb_name ) key ( cweb_id );
+ domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
+ url ( url_id, url, canonical_url_id ) key ( url_id );
+ query ( query_id, query ) key ( query_id );
+There will be the following set of tables too,
+ cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
+ domain_source( domain_id, source_id ) key ( domain_id, source_id );
+ url_source ( url_id, source_id ) key ( url_id, source_id );
+ query_source ( query_id, source_id ) key ( query_id, source_id );
+ result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );
+The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.
+There will also be the following tables,
+ http_content ( url_id, time_of_retrieval, mime_type, content );
+ html_content ( url_id, time_of_index, total_size, heading_text,
+   navigation_text, content_text, alt_text, title_text );
+ link_content ( source_url_id, url_id, title_text, anchor_text );
+The master database will have the following additional tables,
+ cweb_availability ( cweb_id, url_blocks, queries_per_month );
+ cweb_url_allocation ( cweb_id, block_id );
+ cweb_query_allocation ( cweb_id, year, month, count );
+URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.
+Cweb sites that are not the master will have a table,
+ query_max ( last_block_id );
+ cweb_query ( block_id, cweb_id, quota );
+In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

Difference between revisions of "Cweb"

Latest revision as of 15:08, 5 July 2012

Contents

Project status

Contributors

Copyright

Source code

Links

TODO

Done

System design

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

ProgClub

Wiki

Mailing lists

Tools