ElasticSearch reindex increase number of shards
Reindex API - mechanism of receation of the index with applying of new settings
We have candidate-application index which has only 1 shard, we need 6
firs of all to create new index we need settings of existing one
# GET candidate-application/_mapping
{
"candidate-application" : {
"mappings" : {
"properties" : {
"employerId" : {
"type" : "keyword"
},
"id" : {
"type" : "keyword"
},
...
}
}
}
}next, create new index with same settings but apply desired configuration for number of shards and replicas
PUT /candidate-application-2
{
"settings": {
"number_of_shards": 6,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"employerId" : {
"type" : "keyword"
},
"id" : {
"type" : "keyword"
},
...
}
}
}Note: while reindexing, we are setting number of replicas to 0, theoreticall it will be little bit faster
Next, start reindex:
OST _reindex?slices=5&wait_for_completion=false
{
"source": {
"index": "candidate-application",
"size": 10000
},
"dest": {
"index": "candidate-application-2",
"version_type": "external"
}
}Notes:
version_type externalask elastic to move everything as iswait_for_completionask elastic to run operation asyncslicesnumber of threads (docs, says that it should match number of shards, in our case we have only one shard, but increasing this value speeds up reindex)sizeby defaul 1К, number of docs copied at once
as response we will receive something like
{
"task" : "Jwk8EOKOSLKxzqHSa2VWuQ:5105786962"
}to check current status:
GET _tasks?detailed=true&actions=*reindex
{
"nodes" : {
"Jwk8EOKOSLKxzqHSa2VWuQ" : {
"name" : "es01",
"transport_address" : "62.149.5.105:9300",
"host" : "62.149.5.105",
"ip" : "62.149.5.105:9300",
"roles" : [
"ingest",
"master",
"data",
"ml"
],
"attributes" : {
"ml.machine_memory" : "16818429952",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20"
},
"tasks" : {
"Jwk8EOKOSLKxzqHSa2VWuQ:5106832820" : {
"node" : "Jwk8EOKOSLKxzqHSa2VWuQ",
"id" : 5106832820,
"type" : "transport",
"action" : "indices:data/write/reindex",
"status" : {
"total" : 1013102,
"updated" : 0,
"created" : 60000,
"deleted" : 0,
"batches" : 60,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "reindex from [candidate-application] to [candidate-application-mac-1][_doc]",
"start_time_in_millis" : 1600430372126,
"running_time_in_nanos" : 12301230038,
"cancellable" : true,
"headers" : { }
}
}
}
}
}Where total - overall number of docs, created - number of copied docs, running_time_in_nanos - time from begining
after completion
GET /_cat/indices/candidate-application*?v&h=index,docs.count&s=index
GET candidate-application/_count
GET candidate-application-mac-1/_countNotes:
- first request may show fron info, untile last request is run
- last request, runned first time may take some time (under the hood refresh is happening)
Switch aliases
POST /_aliases
{
"actions": [
{
"remove": {
"index": "candidate-application",
"alias": "candidate-application-alias"
}
},
{
"add": {
"index": "candidate-application-mac-1",
"alias": "candidate-application-alias"
}
}
]
}To see current status of aliases
GET _cat/aliases/candidate-application*?v&h=alias,indexTiming
time curl -s -X POST -u 'elastic:AVMrMHQ56augSsSGLAs3xahYB' -H 'Content-Type: application/json' 'https://es01.rabota.ua:9200/_reindex?slices=5' -d '{
"source": {
"index": "candidate-application",
"size": 10000
},
"dest": {
"index": "candidate-application-mac-1",
"version_type": "external"
}
}'index with one million of docs were reindexed in 14 seconds
test on big index with 16M docs, 60gb, takes approx 30min which is fine, because copying of such amount of data on its own is not fast
here is small script to see whats going on
Пока ждал накалякал скрипт что бы смотреть что происходит
$username = 'elastic'
$password = '*******'
$base64 = [System.Convert]::ToBase64String([System.Text.Encoding]::ASCII.GetBytes("$($username):$($password)"))
$headers = @{ Authorization = "Basic $base64" }
$res = Invoke-RestMethod "https://es01.rabota.ua:9200/_tasks?detailed=true&actions=*reindex" -Headers $headers | Select-Object -ExpandProperty nodes
$key = $res | Get-Member -MemberType NoteProperty | Select-Object -ExpandProperty Name -First 1
$taskKeys = $res.$key.tasks | Get-Member -MemberType NoteProperty | Select-Object -ExpandProperty Name
foreach($taskKey in $taskKeys) {
$task = $res.$key.tasks.$taskKey
if ($task.status.total -eq 0) {
Write-Host "$taskKey - empty"
continue
}
$percent = [int]($task.status.created / $task.status.total * 100)
$timer = [TimeSpan]::FromMilliseconds($task.running_time_in_nanos/1000000).ToString()
Write-Host "$taskKey - $($percent)% in $timer"
}Output will be something like:
Jwk8EOKOSLKxzqHSa2VWuQ:5150731438 - empty
Jwk8EOKOSLKxzqHSa2VWuQ:5150731439 - 64% in 00:20:45.4927814
Jwk8EOKOSLKxzqHSa2VWuQ:5150731441 - 66% in 00:20:45.4925209
Jwk8EOKOSLKxzqHSa2VWuQ:5150731443 - 68% in 00:20:45.4923133
Jwk8EOKOSLKxzqHSa2VWuQ:5150731446 - 67% in 00:20:45.4921202
Jwk8EOKOSLKxzqHSa2VWuQ:5150731449 - 66% in 00:20:45.4919510Note: first empty record is ok, this one acts as a parent for child threads
Here are results for big index:
{
"size": 0,
"aggs": {
"status": {
"terms": {
"field": "statusId",
"size": 10
}
}
}
}On old index with single shard - 550ms, on new index - 20ms