![](https://img.51dongshi.com/20241130/wz/18297492052.jpg)
今天嘗試配置Redis Sentinel 來監控Redis服務器,中間由于某些設想我突然想到如果兩個Redis實例互相slaveof會怎樣。以下是我的試驗: 兩個Redis實例,redis1配置作為master,redis2配置作為slave:slaveof redis1。 啟動redis1、redis2。 啟動成功并且redis2也成功slaveof redis1后,redis-cli連接redis1,執行命令將redis1設置為redis2的從庫: slaveof [redis2 IP][redis2 port] 執行后的結果是......兩個redis都在重復拋出SYNC命令執行失敗的log,也就是顯然兩個redis不能互相作為從庫。 redis1執行slaveof后的log: [14793] 06 Sep 17:36:20.426 * SLAVE OF 10.18.129.49:9778 enabled (user request) [14793] 06 Sep 17:36:20.636 - Accepted 10.18.129.49:44277 [14793] 06 Sep 17:36:20.637 - Client closed connection [14793] 06 Sep 17:36:20.804 * Connecting to MASTER... [14793] 06 Sep 17:36:20.804 * MASTER SLAVE sync started [14793] 06 Sep 17:36:20.804 * Non blocking connect for SYNC fired the event. [14793] 06 Sep 17:36:20.804 * Master replied to PING, replication can continue... [14793] 06 Sep 17:36:20.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master [14793] 06 Sep 17:36:21.636 - Accepted 10.18.129.49:44279 [14793] 06 Sep 17:36:21.637 - Client closed connection [14793] 06 Sep 17:36:21.804 * Connecting to MASTER... [14793] 06 Sep 17:36:21.804 * MASTER SLAVE sync started [14793] 06 Sep 17:36:21.804 * Non blocking connect for SYNC fired the event. [14793] 06 Sep 17:36:21.804 * Master replied to PING, replication can continue... [14793] 06 Sep 17:36:21.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master [14793] 06 Sep 17:36:22.636 - Accepted 10.18.129.49:44281 [14793] 06 Sep 17:36:22.637 - Client closed connection [14793] 06 Sep 17:36:22.804 * Connecting to MASTER... [14793] 06 Sep 17:36:22.804 * MASTER SLAVE sync started [14793] 06 Sep 17:36:22.804 * Non blocking connect for SYNC fired the event. [14793] 06 Sep 17:36:22.804 * Master replied to PING, replication can continue.. redis2的log: [14796] 06 Sep 17:36:20.426 - Client closed connection [14796] 06 Sep 17:36:20.636 * Connecting to MASTER... [14796] 06 Sep 17:36:20.636 * MASTER SLAVE sync started [14796] 06 Sep 17:36:20.636 * Non blocking connect for SYNC fired the event. [14796] 06 Sep 17:36:20.636 * Master replied to PING, replication can continue... [14796] 06 Sep 17:36:20.636 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master [14796] 06 Sep 17:36:20.804 - Accepted 10.18.129.49:51034 [14796] 06 Sep 17:36:20.805 - Client closed connection [14796] 06 Sep 17:36:21.636 * Connecting to MASTER... [14796] 06 Sep 17:36:21.636 * MASTER SLAVE sync started [14796] 06 Sep 17:36:21.636 * Non blocking connect for SYNC fired the event. [14796] 06 Sep 17:36:21.636 * Master replied to PING, replication can continue... [14796] 06 Sep 17:36:21.637 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master [14796] 06 Sep 17:36:21.804 - Accepted 10.18.129.49:51036 [14796] 06 Sep 17:36:21.805 - Client closed connection [14796] 06 Sep 17:36:22.636 - DB 0: 20 keys (0 volatile) in 32 slots HT. [14796] 06 Sep 17:36:22.636 - 0 clients connected (0 slaves), 801176 bytes in use [14796] 06 Sep 17:36:22.636 * Connecting to MASTER... [14796] 06 Sep 17:36:22.636 * MASTER SLAVE sync started [14796] 06 Sep 17:36:22.636 * Non blocking connect for SYNC fired the event. [14796] 06 Sep 17:36:22.636 * Master replied to PING, replication can continue.. 兩個redis就這樣都進入SYNC失敗的死循環狀態。 我想到的疑問是:為什么原來的從庫redis2會重新執行SYNC命令? 從上面的redis2的log第一行可以看到原先的主從連接斷開了。 看了執行主從設置的源碼replication.c,下面是redis1執行slaveof命令的代碼,它在中間執行disconnectSlaves()導致原來的主從連接斷開: void slaveofCommand(redisClient *c) { if (!strcasecmp(c->argv[1]->ptr,"no") &&!strcasecmp(c->argv[2]->ptr,"one")) { // 省略了 } else { // 省略了 /* There was no previous master or the user specified a different one, * we can continue. */ sdsfree(server.masterhost); server.masterhost = sdsdup(c->argv[1]->ptr); server.masterport = port; if (server.master) freeClient(server.master); disconnectSlaves(); /* Force our slaves to resync with us as well. */ cancelReplicationHandshake(); server.repl_state = REDIS_REPL_CONNECT; redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)", server.masterhost, server.masterport); } addReply(c,shared.ok); } disconnectSlaves()旁邊的注解是:Force our slaves to resync with us as well. 意思類似于先把你們(redis2)斷開,等我(redis1)同步我的主庫搞定后你們再來向我同步。這樣導致redis2和redis1斷開了,而redis2一開始作為從庫如果它和主庫斷開它會不斷嘗試重新連接并執行SYNC命令直到成功。 了解了為什么redis2也執行SYNC命令后,第二個疑問是為什么兩個redis的SYNC操作都會一直失敗,實際上原因和第一個差不多。兩個redis的log異常都是:ERR Can't SYNC while not connected with my master。這個log在代碼中是: void syncCommand(redisClient *c) { /* ignore SYNC if already slave or in monitor mode */ if (c->flags & REDIS_SLAVE) return; /* Refuse SYNC requests if we are a slave but the link with our master * is not ok... */ if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED) { addReplyError(c,"Can't SYNC while not connected with my master"); return; } /* SYNC can't be issued when the server has pending data to send to * the client about already issued commands. We need a fresh reply * buffer registering the differences between the BGSAVE and the current * dataset, so that we can copy to other slaves if needed. */ if (listLength(c->reply) != 0) { addReplyError(c,"SYNC is invalid with pending input"); return; } //省略 } syncCommand函數是Redis作為主庫收到從庫發來的SYNC命令時的處理,看上面注釋部分“Refuse SYNC requests if we are a slave but the link with our master is not ok...”。當redis1作為主庫收到從庫的SYNC命令,會執行syncCommand函數,其中if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED)... ,redis1剛好設置為別的主庫(redis2)的從庫但還沒完成同步工作(redis1需要向redis2發送SYNC請求并且返回成功才能完成同步,而redis2處理redis1的SYNC請求時又需要redis1處理好redis2的SYNC請求才行,這導致死鎖了),所以這個判斷返回true,redis1直接reply error:Can't SYNC while not connected with my master)。redis2的情況也一樣,所以雙方都處在Can't SYNC while not connected with my master的狀態。