Wikipedia data
Utils
Show the first 10 line of sql file: head -n 10 ./dump/page.sql
To inspect volume size used by database: docker system df -v | grep 'wikipedia-solver-mariadb-data'
Remove a volume
# List all volumes
docker volume ls
# Remove a volume
docker volume rm data_wikipedia-solver-mariadb-data
# Or by using docker compose down
docker-compose down --volumes
MySQL Related
https://stackoverflow.com/questions/43954631/issues-with-wikipedia-dump-table-pagelinks
MySQL any way to import a huge (32 GB) sql dump faster?: https://stackoverflow.com/questions/40384864/importing-wikipedia-dump-to-mysql
Import data.sql MySQL Docker Container: https://stackoverflow.com/questions/43880026/import-data-sql-mysql-docker-container
https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster
Dumps Links
- Database layout: https://www.mediawiki.org/wiki/Manual:Database_layout
- https://en.wikipedia.org/wiki/Wikipedia:Database_download
- https://dumps.wikimedia.org/enwiki/
page.sql.gz
MySQL full version
-- MariaDB dump 10.19 Distrib 10.5.23-MariaDB, for debian-linux-gnu (x86_64)
--
-- Host: db1206 Database: enwiki
-- ------------------------------------------------------
-- Server version 10.6.17-MariaDB-log
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
--
-- Table structure for table `page`
--
DROP TABLE IF EXISTS `page`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `page` (
`page_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
`page_namespace` int(11) NOT NULL DEFAULT 0,
`page_title` varbinary(255) NOT NULL DEFAULT '',
`page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_random` double unsigned NOT NULL DEFAULT 0,
`page_touched` binary(14) NOT NULL,
`page_links_updated` varbinary(14) DEFAULT NULL,
`page_latest` int(8) unsigned NOT NULL DEFAULT 0,
`page_len` int(8) unsigned NOT NULL DEFAULT 0,
`page_content_model` varbinary(32) DEFAULT NULL,
`page_lang` varbinary(35) DEFAULT NULL,
PRIMARY KEY (`page_id`),
UNIQUE KEY `page_name_title` (`page_namespace`,`page_title`),
KEY `page_random` (`page_random`),
KEY `page_len` (`page_len`),
KEY `page_redirect_namespace_len` (`page_is_redirect`,`page_namespace`,`page_len`)
) ENGINE=InnoDB AUTO_INCREMENT=77490241 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `page`
--