Difference between revisions of "Cweb"

From ProgClub
Jump to: navigation, search
(Mobile Casino Games - Part A single)
(Reverting spam.)
 
Line 1: Line 1:
What Is Different About Social Marketing?
+
This is a draft document, very much a work in progress. See the [[Talk:Cweb|talk page]] for notes and caveats. See [[Projects]] for other projects.
+
 
The term social marketing refers to the practice of using conventional marketing tools to alter human behavior towards some social good, instead than as a ways of growing company income. It surfaced formally in 1971, when Philip Kotler and Gerald Zaltman published a very content, where these people discussed how marketing could become used to sell ideas and grab about positive social change for the collective benefit of society, fairly than for the singular benefit of the business engaged in advertising such change. In some ways, social marketing is the opposite of commercial marketing, in that it attempts to respond to public needs, instead than seeking to establish a need for a offered product or service or service.  
+
== Project status ==
+
 
This alternate approach can be used to encourage people to stop engaging in particular socially destructive behaviors, such because smoking cigarettes, driving within the influence of alcohol, eating way too much sugar and fat, or finding a tan. It can in addition end up being used to promote positive behaviors, this sort of as exercising a lot more, refraining coming from unsafe sexual intercourse, recycling waste, or volunteering. Government agencies and NGOs have made effective use with this approach to build up health initiatives (e.g., contraception, oral rehydration therapy) in developing nations. Nevertheless, in commercial applications, the approach provides not been utilized to everywhere near the same degree, partially as a result of difficulty in determining the go back on investment related to this sort of campaigns.  
+
In the planning phase.
+
 
Conventional marketing practice is actually based close to four elements: product or service, price, place, and promotion. With social marketing, the item might be a reasonably intangible idea, for instance a clean atmosphere, or a healthier population. The price might be far more a question of opportunity cost, since people set exactly how much percentage of your day has to be invested to provide about the change being advocated. Because ideas can travel in a multitude of ways to and from almost any place, pinpointing the most effective distribution stations may be difficult to accomplish. Similarly, social marketers are confronted with the challenge of deciding the optimum method of marketing positive social change. In other words and phrases, carrying out a successful social marketing campaign may require a complete reconceptualization in the standard marketing mix.  
+
== Contributors ==
   
+
 
While a final note, social marketing should not become upset with social press marketing which is concerned with hitting more potential customers, by mounting content on the net that promotes readers to share information about products and services through their social networks, through Fb, Tweets, and LinkedIn, among others. At the same time, the use of social mass media could prove to be considered a highly successful means for carrying out social marketing classes.  
+
Members who have contributed to this project. Newest on top.
   
+
 
Robert A. Campbell, Ph.D. writes about religion, science, genealogy, books, writing, and knowledge operations. For far more details about his research and writing, particularly his the job on the Qur'a good, visit Robert Campbell.  
+
* [[User:John|John]]
+
 
http://www.bristolcountypt.com/index.php/member/65995/
+
All contributors have agreed to the terms of the [[ProgClub:Copyrights#ProgClub_projects|Contributor License Agreement]]. This excludes any upstream contributors who tend to have different administrative frameworks.
 +
 
 +
== Copyright ==
 +
 
 +
Copyright 2011, [[Cweb#Contributors|Contributors]]. Licensed under the [[GPL]].
 +
 
 +
== Source code ==
 +
 
 +
Subversion project isn't configured yet.
 +
 
 +
== Links ==
 +
 
 +
* [https://github.com/deoxxa/mitsukeru mitsukeru] on [http://www.reddit.com/r/programming/comments/krzys/show_proggit_im_working_on_a_search_engine_does/ reddit]
 +
 
 +
== TODO ==
 +
 
 +
Things to do, in rough order of priority:
 +
 
 +
* Create project in svn
 +
 
 +
== Done ==
 +
 
 +
Stuff that's done. Latest on top.
 +
 
 +
* [[User:John|JE]] 2011-08-08: started documentation
 +
 
 +
== System design ==
 +
 
 +
Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.
 +
 
 +
The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.
 +
 
 +
Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.
 +
 
 +
There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.
 +
 
 +
Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.
 +
 
 +
There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on [http://www.progsoc.org/wiki/Morpheus Morpheus] for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:
 +
 
 +
* Nominating and distributing the blacklist
 +
* Nominating and distributing Cweb Providers (name and 32-bit ID)
 +
* Nominating and distributing Domain IDs (name and 32-bit ID)
 +
* Nominating and distributing URL IDs (URL and 64-bit ID)
 +
* Nominating and distributing Query IDs (string and 64-bit ID)
 +
* Coordinating the Word database
 +
 
 +
Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.
 +
 
 +
Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.
 +
 
 +
Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).
 +
 
 +
Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link [http://www.progclub.org/wiki/Cweb Cweb], then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.
 +
 
 +
We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.
 +
 
 +
We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:
 +
 
 +
* Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
 +
* Is the word punctuation?
 +
* Is the word a number, and if so what is its value?
 +
* Is the word a plural, and if so, what is the root word?
 +
* Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
 +
* Does the word have synonyms, and what are they?
 +
* Is the word a proper name?
 +
* Is the word an email address, or a URL?
 +
* What languages is the UTF-8 string a word in?
 +
* What senses does the word have?
 +
 
 +
Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.
 +
 
 +
The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).
 +
 
 +
It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.
 +
 
 +
Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.
 +
 
 +
There will be a table with the schema,
 +
 
 +
  result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );
 +
 
 +
In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.
 +
 
 +
Other tables in the system will be,
 +
 
 +
cweb ( cweb_id, cweb_name ) key ( cweb_id );
 +
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
 +
url ( url_id, url, canonical_url_id ) key ( url_id );
 +
query ( query_id, query ) key ( query_id );
 +
 
 +
There will be the following set of tables too,
 +
 
 +
cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
 +
domain_source( domain_id, source_id ) key ( domain_id, source_id );
 +
url_source ( url_id, source_id ) key ( url_id, source_id );
 +
query_source ( query_id, source_id ) key ( query_id, source_id );
 +
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );
 +
 
 +
The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.
 +
 
 +
There will also be the following tables,
 +
 
 +
http_content ( url_id, time_of_retrieval, mime_type, content );
 +
html_content ( url_id, time_of_index, total_size, heading_text,
 +
  navigation_text, content_text, alt_text, title_text );
 +
link_content ( source_url_id, url_id, title_text, anchor_text );
 +
 
 +
The master database will have the following additional tables,
 +
 
 +
cweb_availability ( cweb_id, url_blocks, queries_per_month );
 +
cweb_url_allocation ( cweb_id, block_id );
 +
cweb_query_allocation ( cweb_id, year, month, count );
 +
 
 +
URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.
 +
 
 +
Cweb sites that are not the master will have a table,
 +
 
 +
query_max ( last_block_id );
 +
  cweb_query ( block_id, cweb_id, quota );
 +
 
 +
In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.

Latest revision as of 15:08, 5 July 2012

This is a draft document, very much a work in progress. See the talk page for notes and caveats. See Projects for other projects.

Project status

In the planning phase.

Contributors

Members who have contributed to this project. Newest on top.

All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.

Copyright

Copyright 2011, Contributors. Licensed under the GPL.

Source code

Subversion project isn't configured yet.

Links

TODO

Things to do, in rough order of priority:

  • Create project in svn

Done

Stuff that's done. Latest on top.

  • JE 2011-08-08: started documentation

System design

Cweb is a Blackbrick project hosted at ProgClub. It will be licensed under the GPL. "Cweb" is for "Collaborative Web", and essentially the software is a distributed search engine implemented on a 64-bit LAMP platform.

The site will be implemented by a distributed set of providers. In order to become a provider a user will need to register their system with ProgClub/Blackbrick. They will get a host entry in the cweb.blackbrick.com DNS zone, so for example my cweb provider site would be jj5.cweb.blackbrick.com. I will then need to setup my 64-bit LAMP server to host /blackbrick-cweb, and maybe setup an appropriate NAT on my home router to my LAMP box.

Not sure yet how I'm going to manage HTTPS and certificates. HTTPS would be nice, but maybe we'll make that a v2 feature. Ideally Blackbrick would be able to issue certificates for hosts in the cweb.blackbrick.com zone, but I'm not sure what would be involved in becoming a CA like that. Self-signed certs might also be a possibility, although not preferable.

There will be a front-end for cweb on all provider sites in /blackbrick-cweb/. The user will be able to submit queries from this front-end, and also submit URLs they find useful/useless for particular queries. In this way cweb will accumulate a database of "useful" (and "useless") search results. Users will be able to go to their own, or others', cweb sites. Users will need to be logged-in to their cweb site in order to submit usefull/useless URLs. Votes for useful/useless URLs will be per cweb site, not per user.

Initially we won't be indexing the entire web. We'll start with HTML only, and have a list of domains that we support. As we grow we can enable the indexing of more domains. We'll start with domains like en.wikipedia.org and useful sites like that. Also, initially we will only be supporting English. That's because I don't know anything about other languages. To the extent that I can I will design so as to make the incorporation of other languages possible as the project matures.

There will be a 'master' cweb site, available from master.cweb.blackbrick.com. I might speak to ProgSoc about getting them to provide me a virtual machine on Morpheus for me to use as the cweb master. As the project matures there might be multiple IP addresses on master.cweb.blackbrick.com. The cweb master is responsible for:

  • Nominating and distributing the blacklist
  • Nominating and distributing Cweb Providers (name and 32-bit ID)
  • Nominating and distributing Domain IDs (name and 32-bit ID)
  • Nominating and distributing URL IDs (URL and 64-bit ID)
  • Nominating and distributing Query IDs (string and 64-bit ID)
  • Coordinating the Word database

Cweb will need to be able to function in an untrusted environment, full of liars and spammers. So, provision will need to be made to facilitate data integrity. Essentially all cweb sites will record the Cweb ID of the site that provided them with particular data, and if that Cweb ID ever makes it onto the blacklist then all data from that site will be deleted.

Cweb will be designed to facilitate anonymous queries. This will work by having Cweb sites forward queries at random. When a request for a query is received by a cweb site, cweb will pick a number between 1 and 10. If the number is less than 5 (i.e. 1, 2, 3 or 4) then cweb will handle the query. In any other case (i.e. 5, 6, 7, 8, 9 or 10) cweb will forward the query to another cweb site for handling. This will mean that when a request is received for a query by a cweb site, it is most likely that the request has been forwarded. In this way, no-one on the network will be able to track the originator of a query.

Cweb will have a HTTP client, a HTML parser, a CSS parser and a JavaScript parser. It will create the HTML DOM, and apply the CSS and run JavaScript. It will then establish what is visible text, and record that for indexing. Runs of white space will be converted to a single space character for the index data. There are a few issues, such as what to do with HTML meta data or image alt text. My first impression is that meta data should be ignored, and alt text should be included in the index. Our HTML environment will be implemented in PHP, and to the extent that we can we will make our facilities compliant with web-standards (rather than with particular user agents, e.g. Internet Explorer or Firefox).

Link names are the text between HTML anchor tags. Some facility will be made for recording and distributing link names, as they are probably trusted meta data about a URL. If I link to http://www.progclub.org/wiki/Cweb and call my link Cweb, then there's probably a good chance that the URL I've linked to is related to "Cweb". Also, if the text "Cweb" appears in the URL, there's a good chance that the URL is related to "CWeb". Provisions should be made to incorporate this type of meta data into the index.

We will use UTF-8 as the encoding for our content. Content that is not in UTF-8 will be converted to UTF-8 prior to processing. Initially we will only be supporting sites that provide their content in UTF-8. As the project matures our support will widen.

We will develop a distributed word database. The word database will be used for deciding the following things about any given space-delimited string:

  • Does the word have 'sub-words'. For instance the word "1,234" has sub-words "1", ",", "234", "1," and ",234". The word "sub-word" has sub-words "sub", "-", "word", "sub-" and "-word". The word "JohnElliot" has sub-words "John" and "Elliot". This might be determined algorithmically rather than by a database.
  • Is the word punctuation?
  • Is the word a number, and if so what is its value?
  • Is the word a plural, and if so, what is the root word?
  • Is the word a common word, such as "a", "the", etc. I'm not sure what we will do about indexing common words.
  • Does the word have synonyms, and what are they?
  • Is the word a proper name?
  • Is the word an email address, or a URL?
  • What languages is the UTF-8 string a word in?
  • What senses does the word have?

Initially I was planning to have Word IDs as a 64-bit number representing any given string, but I decided against this for a number of reasons. Firstly, you can get eight UTF-8 characters before you get any savings in terms of space, and most words are less than eight characters long (in English, any way). Secondly, distributing the Word IDs would have created a lot of unnecessary overhead in the system. There would need to be a centralised repository responsible for nominating the ID of new words, and all systems would have to talk back to this repository whenever they encountered a word for which they didn't have the Word ID.

The Cweb index will segregate data into "bundles" at indexing time. Bundles will be "heading text", "navigation text", "alt text" and "content". There might also be "meta data" bundles. So, if some text appears in a heading, it will go in the "heading text" bundle. If text appears in an image alt text attribute it will go in the "alt text" bundle. If text appears anywhere else, it will go in the "content" bundle. See below for the caveat concerning "navigation text". In this way we can weight heading text as more important than content text, navigation text as less important than content text, and alt text as more relevant to image search (when we support image search, which is initially a non-goal).

It might be a good idea to accumulate a database concerning the HTML element in which a web-page's content appears. For instance, in the domain "www.progclub.org", at the URL suffix "/wiki", the content of the page is in the HTML element with ID "content". This means that anything which is not below HTML ID "content" is "navigation text", and anything which is below "content" is "content". Similarly, for the domain "www.progclub.org", for the URL suffix "/blog", the content of the page is in the HTML element with ID "content". It might even be feasible to say that if there is no registered 'content' ID for a particular domain with a particular prefix, if the HTML ID 'content' (and perhaps other synonyms, such as 'main') is discovered, then it will be assumed that content not below that HTML element is navigation text. In the case where the defaults are not satisfactory, they can be overridden by the database.

Cweb sites will be given an a 32-bit ID. So, for example, the Cweb site jj5.cweb.blackbrick.com will have cweb name 'jj5', and cweb ID '1'. Domains will be given a 32-bit ID. So, for example, the domain www.progclub.org will be given domain ID '123'. URLs will be given a 64-bit ID. So, for example, the URL http://www.progclub.org/wiki/Cweb will be given URL ID '456'. Queries will be given a 64-bit ID. So, for example, the query "Cweb" will be given query ID '678' and the query "cweb" will be given query ID "679'.

There will be a table with the schema,

result ( cweb_id, query_id, url_id, weight ) key ( cweb_id, query_id, url_id );

In this table 'weight' will be a number between 0 and 100. The value 50 will indicate impartiality. A value below 50 indicates that the URL has been voted by cweb_id as being less than useful, to some degree. The closer to zero the value is, the less useful the link is considered. A value above 50 indicates that the URL has been voted by cweb_id as being useful, to some degree. The closer to one hundred the value is, the more useful the link is considered. The average weight can then be taken, and factored into the weight given to search results for particular queries.

Other tables in the system will be,

cweb ( cweb_id, cweb_name ) key ( cweb_id );
domain ( domain_id, domain_name, canonical_domain_id ) key ( domain_id );
url ( url_id, url, canonical_url_id ) key ( url_id );
query ( query_id, query ) key ( query_id );

There will be the following set of tables too,

cweb_source ( cweb_id, source_id ) key ( cweb_id, source_id );
domain_source( domain_id, source_id ) key ( domain_id, source_id );
url_source ( url_id, source_id ) key ( url_id, source_id );
query_source ( query_id, source_id ) key ( query_id, source_id );
result_source ( cweb_id, query_id, url_id, source_id ) key ( cweb_id, query_id, url_id, source_id );

The 'source_id' is the ID of the Cweb from which the data was retrieved. There can be multiple Cweb IDs associated with any given record. If any of the Cweb IDs associated with a record are registered on the blacklist, then the associated record is to be deleted.

There will also be the following tables,

http_content ( url_id, time_of_retrieval, mime_type, content );
html_content ( url_id, time_of_index, total_size, heading_text, 
  navigation_text, content_text, alt_text, title_text );
link_content ( source_url_id, url_id, title_text, anchor_text );

The master database will have the following additional tables,

cweb_availability ( cweb_id, url_blocks, queries_per_month );
cweb_url_allocation ( cweb_id, block_id );
cweb_query_allocation ( cweb_id, year, month, count );

URLs will be partitioned into blocks of 1000 URLs. The first block will be URL ID 1 to 999, the second block will be URL 1000 to 1999, the third block 2000 to 2999, and so on. Blocks will be indicated by the first ID in the block, so block 1 will be 1, block 2 will be 1000, block 3 will be 2000, and so on. Cweb sites can nominate how many URL blocks they will index. After a while we'll have a good idea of how much data is associate with the average URL, and we'll be able to let users nominate how much bandwidth they are willing to provide for indexing per month. We'll also be able to have users nominate how much bandwidth they are willing to provide for queries per month.

Cweb sites that are not the master will have a table,

query_max ( last_block_id );
cweb_query ( block_id, cweb_id, quota );

In order to satisfy a query, the requested cweb site will need to contact another cweb site for each block. Every time a cweb site interacts with the master site (for instance, when updating its blacklist) it will receive the 'last_block_id', which is essentially the maximum URL ID rounded down the the nearest 1000 (or 1 if its below 1000). So each cweb can keep its query_max table up-to-date with the number of blocks being indexed by the distributed system. Say the last block ID was 5000, meaning that the system was indexing up to 5999 URLs. In this case each cweb site would need to have an entry in its cweb_query table for each block in the system. Say the query was "Cweb". A site handling the query for "Cweb" would need to know which cweb's it needed to contact to search the entire index. It would need to have a cweb_query record for each block in the system, being 1, 1000, 2000, 3000, 4000 and 5000. It would start by looking for blocks that it already had, whose quota was greater than zero (when the quota reaches zero the cweb_query record is deleted, so essentially it will just be looking for cweb_query records that it has). Say it finds cweb_query records of { ( 1, 1, 50 ), ( 1000, 1, 50 ), ( 2000, 2, 100 ) }. It's then missing cweb_query records for block 3000, 4000 and 5000, so it will then contact the master server with the list of blocks it's missing results for. The master server will look at the request and find suitable servers to contact for each block. It might respond with { ( 3000, 1, 100 ), ( 4000, 2, 100 ), ( 5000, 3, 50 ) }. This means that for block 4 (URL ID = 3000), the query will go to cweb 1, up to 100 times; for block 5 (URL ID = 4000), the query will go to cweb 2, up to 100 times; for block 6 (URL ID = 5000), the query will go to cweb 3, up to 50 times; and so on. The third cell in each response is the quota. A cweb will record the quota, and decrement that cweb_query.quota by one each time it contacts a cweb to handle a request. When the quota reaches zero the cweb_query record will be deleted. In this way, a cweb can establish that for block 1, it can send index search requests to cweb 1 up to 100 times before it needs to contact the master server again for a cweb_query allocation. The master server will track how many allocations it has given each cweb in a given month, and if the allocation reaches the user's defined quota then no more cweb_query allocations will be made for the user's cweb site.