Wednesday, February 10, 2016

Why Puppet was using 96% CPU while ran as a daemon or through command prompt?

Problem :  Puppet was consuming 96% CPU while it ran as a daemon or ran as "puppet agent -t".
Environment: RHEL 6
Solution : Make sure application puppet module is tested with out any error and unwanted issue.
Root Cause Analysis: 

top has shown following:


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11403 root 20 0 259m 140m 4516 R 96.7 1.9 9:08.77 puppet
1. Traced puppet agent which was run at command prompt while I kept daemon off.
$strace -Tvv  puppet agent -t  &> ./strace_log

2. read() system calls are spawned a lot and continuously printing following messages:
[...]
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
[...]
3. While I was tracking filename of corresponding file descriptor, I found /home/qpadm/temp was opened that had size 101GB.
4. read() call was continuously trying to read data from this file and read size was 4096 Bytes in every call. Trace looked below:
22:36:37 open("/home/qpadm/temp", O_RDONLY) = 4 <0 .000032="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000027="">
22:36:37 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0 .000026="">
22:36:37 fstat(4, {st_dev=makedev(202, 1), st_ino=790228, st_mode=S_IFREG|0600, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=209715224, st_size=107374182400, st_atime=2016/02/03-07:37:15, st_mtime=2016/02/03-05:11:57, st_ctime=2016/02/03-07:36:48}) = 0 <0 .000027="">
22:36:37 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000030="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000035="">
22:36:37 read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0 .000031="">
In above trace, you see file /home/qpadm/temp is opened in read mode as FD(file descriptor) is 4 and same FD is passed to read() call that returns 4096 Bytes incrementally. It supposed to take long time to read 101GB file[:) would never be finished using huge read calls those would have neen performed by kernel]. That's why CPU spiked to 60-80% consumed by puppet agent.
We need to check why above file is there and what is the usage of it and why it has large size.
This one is I think main culprit.
Another thing I noticed is either ovo was reinstalled or being migrated to newer version. In relation to this, I found following calls :
22:36:00 open("/root/ovoinstall/HPOvPacc.xml", O_RDONLY) = 4 <0 .000022="">

22:36:07 open("/root/ovoinstall/OVO-Agent-Migrate.xml", O_RDONLY) = 4 <0 .000032="">

22:36:07 fstat(4, {st_dev=makedev(202, 1), st_ino=786799, st_mode=S_IFREG|0744, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=4240, st_size=2168841, st_atime=2016/02/03-17:04:19, st_mtime=2016/01/03-22:12:54, st_ctime=2016/01/03-22:15:32}) = 0 <0 .000027="">
22:36:07 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6d03f3f000 <0 .000029="">
22:36:07 read(4, "\355\253\356\333\3\0\0\0\0\1HPOvConf-11.11.025-1\0\0"..., 4096) = 4096 <0 .000032="">
22:36:07 read(4, "\"Unable to copy file\"\nOVCSL_ERRO"..., 4096) = 4096 <0 .000029="">

Application team found that temp was not needed but was called in application puppet module. They disabled it and removed that file. It fixed the issue.


Wednesday, February 3, 2016

why vxdisk does throw error "VxVM vxdctl ERROR V-5-1-16007 Data Corruption Protection Activated - User Corrective Action Needed To recover"

Problem:

Following errors has been encountered :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
VxVM vxdctl ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#


[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks
VxVM vxdisk ERROR V-5-1-16007  Data Corruption Protection Activated - User Corrective Action Needed
To recover, first ensure that the OS device tree is up to date (requires OS specific commands).
Then, execute 'vxdisk rm' on the following devices before reinitiating device discovery:
        fas31400_85, fas31400_88, fas31400_92, fas31400_98, fas31400_93
        fas31400_96, fas31400_94, fas31400_95, fas31400_97, fas31400_99
        fas31402_41, fas31401_39
[root@customer-pet-db-3a ~]#

Solution referred :


1. check disk :

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    myapps_pet_data00  flex3a_myapps_pet_data_dg online thinrclm
fas31400_88  auto:cdsdisk    myapps_pet_data01  flex3a_myapps_pet_data_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_93  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_94  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_95  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_96  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_97  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_98  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31400_99  auto:cdsdisk    -            (myapp_pet_data_dg) online thinrclm
fas31401_39  auto:cdsdisk    myapps_pet_data02  flex3a_myapps_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    myapps_pet_data03  flex3a_myapps_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

2. Removed all disk of myapp and myapps like as it is stated above :


[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-534 Device fas31400_85: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31400_88: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31401_39: Device is in use
VxVM vxdisk ERROR V-5-1-534 Device fas31402_41: Device is in use
[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net

flex3a_myapps_pet_data_dg  is still in use. Need to deport it.

[root@customer-pet-db-3a ~]# /sbin/vxdg deport flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]# vxdg list
NAME         STATE           ID
flex3a_pet_customerumprd_arch_dg enabled,cds          1446853137.45.customer-pet-db-3a.example.net
flex3a_arch_dg enabled,cds          1376078555.40.customer-pet-db-3a.example.net
flex3a_pet_customerumprd_dg enabled,cds          1446683837.59.customer-pet-db-3b.example.net
[root@customer-pet-db-3a ~]#


Remove disks which were in use : 

[root@customer-pet-db-3a ~]# for i in fas31400_85 fas31400_88 fas31400_93 fas31400_94 fas31400_95 fas31400_96 fas31400_97 fas31400_98 fas31400_99 fas31401_39 fas31402_41; do vxdisk rm $i; done
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_93: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_94: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_95: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_96: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_97: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_98: no such object in the configuration
VxVM vxdisk ERROR V-5-1-639 Failed to obtain locks:
        fas31400_99: no such object in the configuration
[root@customer-pet-db-3a ~]#

don't worry above error as we have already removed disk of myapp. That's why above errors are displayed.

check disk now:

[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
[root@customer-pet-db-3a ~]#

3. Enable and Scan disk again :

[root@customer-pet-db-3a ~]# /sbin/vxdctl enable
[root@customer-pet-db-3a ~]# /sbin/vxdisk scandisks

Didn't see any more error.

4. Export dg now :

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapp_pet_data_dg import myapp_pet_data_dg
[root@customer-pet-db-3a ~]# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
cciss/c0d0   auto:none       -            -            online invalid
fas31400_10  auto:cdsdisk    flex3a_myapp_arch  flex3a_arch_dg online thinrclm
fas31400_11  auto:cdsdisk    flex3a_dcd_arch  flex3a_arch_dg online thinrclm
fas31400_12  auto:cdsdisk    flex3a_myapps_arch  flex3a_arch_dg online thinrclm
fas31400_13  auto:cdsdisk    flex3a_rpt_arch  flex3a_arch_dg online thinrclm
fas31400_85  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_88  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31400_89  auto:cdsdisk    customerumprd_arch01  flex3a_pet_customerumprd_arch_dg online thinrclm
fas31400_90  auto:cdsdisk    customerumprd_u02  flex3a_pet_customerumprd_dg online thinrclm
fas31400_91  auto:cdsdisk    customerumprd_u01  flex3a_pet_customerumprd_dg online thinrclm
fas31400_92  auto:cdsdisk    myapp_pet_data00  flex3a_myapp_pet_data_dg online thinrclm
fas31400_93  auto:cdsdisk    myapp_pet_data06  flex3a_myapp_pet_data_dg online thinrclm
fas31400_94  auto:cdsdisk    myapp_pet_data03  flex3a_myapp_pet_data_dg online thinrclm
fas31400_95  auto:cdsdisk    myapp_pet_data07  flex3a_myapp_pet_data_dg online thinrclm
fas31400_96  auto:cdsdisk    myapp_pet_data05  flex3a_myapp_pet_data_dg online thinrclm
fas31400_97  auto:cdsdisk    myapp_pet_data04  flex3a_myapp_pet_data_dg online thinrclm
fas31400_98  auto:cdsdisk    myapp_pet_data01  flex3a_myapp_pet_data_dg online thinrclm
fas31400_99  auto:cdsdisk    myapp_pet_data02  flex3a_myapp_pet_data_dg online thinrclm
fas31401_39  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31401_40  auto:cdsdisk    myapp_pet_data08  flex3a_myapp_pet_data_dg online thinrclm
fas31401_41  auto:cdsdisk    myapp_pet_data11  flex3a_myapp_pet_data_dg online thinrclm
fas31401_42  auto:cdsdisk    myapp_pet_data10  flex3a_myapp_pet_data_dg online thinrclm
fas31401_43  auto:cdsdisk    myapp_pet_data09  flex3a_myapp_pet_data_dg online thinrclm
fas31402_41  auto:cdsdisk    -            (flex3a_myapps_pet_data_dg) online thinrclm
fas31402_45  auto:cdsdisk    myapp_pet_data14  flex3a_myapp_pet_data_dg online thinrclm
fas31402_46  auto:cdsdisk    myapp_pet_data12  flex3a_myapp_pet_data_dg online thinrclm
fas31402_47  auto:cdsdisk    myapp_pet_data15  flex3a_myapp_pet_data_dg online thinrclm
fas31402_48  auto:cdsdisk    myapp_pet_data13  flex3a_myapp_pet_data_dg online thinrclm
[root@customer-pet-db-3a ~]#

Received error for myapps, 

[root@customer-pet-db-3a ~]# /sbin/vxdg -C -n flex3a_myapps_pet_data_dg import myapps_pet_data_dg
VxVM vxdg ERROR V-5-1-10978 Disk group myapps_pet_data_dg: import failed:
No valid disk found containing disk group
[root@customer-pet-db-3a ~]#

DG is already renamed as flex3a_myapps_pet_data_dg. Just import it simply.

[root@customer-pet-db-3a ~]# vxdg import flex3a_myapps_pet_data_dg

[root@customer-pet-db-3a ~]#


check dg now :

[root@customer-pet-db-3a ~]# vxdg list|grep -i myapp
flex3a_myapp_pet_data_dg enabled,cds          1375815375.127.customer-pet-db-3c.example.net
flex3a_myapps_pet_data_dg enabled,cds          1375818468.64.customer-pet-db-3c.example.net
[root@customer-pet-db-3a ~]#

Confirmed with DBA, they can start DB instances.


Thursday, December 24, 2015

why did kinit throw error "Preauthentication failed while getting initial credentials"?

Problem: 
While executed following command :

#kinit -k -t /root/utilscripts/nsupdateuser.keytab nsupdate@example.com

it threw error:

kinit: Preauthentication failed while getting initial credentials

Solution: Password of user may be wrong, Try to reset and test again.

Why did kinit throw error "KDC reply did not match expectations while getting initial credentials"

Problem: 

While executing following command:

#kinit username@MYDOMAIN.COM -k -t username.keytab

it threw error :

kinit: KDC reply did not match expectations while getting initial credentials

Solution : 

user doesn't have remote access to the machine. 

How to manage kerberos keytab file?

1. Use klist to display the keytab file entries:

klist -e -k -t  mykeytabfile.keytab
or klist -ekt nsupdateuser.ktab
or type command:

#ktutil     # execute this command
ktutil:     # this prompt will appear
ktutil: read_kt /etc/apache2/http.keytab   #read keytab file
ktuilt: list                             #list all princples

example:

[root@customer-prod-util-101 utilscripts]# klist -e -k -t  nsupdateuser.ktab
Keytab name: FILE:nsupdateuser.ktab
KVNO Timestamp         Principal
---- ----------------- --------------------------------------------------------
   1 09/08/15 21:50:45 nsupdate@site1.example.com (aes256-cts-hmac-sha1-96)
[root@customer-prod-util-101 utilscripts]#


2. Following is an example of the keytab file creation process using kerberos method :

  > ktutil
  ktutil:  addent -password -p username@example.com -k 1 -e rc4-hmac
  Password for username@example.com: [enter your password]
  ktutil:  addent -password -p username@example.com -k 1 -e aes256-cts
  Password for username@example.com: [enter your password]
  ktutil:  wkt username.keytab
  ktutil:  quit

Following is an example using Heimdal Kerberos:

> ktutil -k username.keytab add -p username@example.com -e arcfour-hmac-md5 -V 1


3. Obtain a ticket-granting ticket using the keytab for testing:

You can check that the keytab contains the appropriate encryption key by attempting to use it to obtain a ticket-granting ticket. This can be done using the kinit command:

#kinit -k -t /etc/nsupdateuser.keytab nsupdate@example.com    # here nsupdate is username exiting in AD. this has privileges to update dns records on win DNS.
#klist                <#will show if ticket is created or not

example :

[root@customer-prod-util-101 ~]#  kinit -k -t /root/utilscripts/nsupdateuser.ktab nsupdate   # uses default domain
[root@customer-prod-util-101 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: nsupdate@site1.example.com

Valid starting     Expires            Service principal
12/24/15 05:29:49  12/24/15 15:29:49  krbtgt/site1.example.com@site1.example.com
        renew until 12/31/15 05:29:49
[root@customer-prod-util-101 ~]#


   Or try to login to test if keytab file works :

Test with out keytab file:

#kinit username@MYDOMAIN.COM
password>     // pass password of username

Test with keytab file:

#kinit username@MYDOMAIN.COM -k -t username.keytab


4. Using a keytab to authenticate scripts:

To execute a script so it has valid Kerberos credentials, use:

  > kinit username@example.com -k -t mykeytab; myscript

list out principle :

>ktutil
ktutil:  rkt nsupdateuser.ktab
ktuilt: list

5. Merging keytab files:

> ktutil
  ktutil: read_kt mykeytab-1
  ktutil: read_kt mykeytab-2
  ktutil: read_kt mykeytab-3
  ktutil: write_kt krb5.keytab
  ktutil: quit


6. Delete principle:

#ktuil
ktutil: rkt
ktuilt: list
ktutil: delete_entry slot-number
ktuilt: wkt
ktuilt: quit


7. Destroy cached ticket:

kdestroy -A  //all cache will be destroyed
kdestroy -C  //this cache will be deleted only

#kdestroy -c "FILE:/tmp/krb5cc_0"

Monday, September 28, 2015

Why did df command throw "disk" Input output error

Issue/Symptom  : While DBA was starting oracle instance, it was failing. On checking FS, it was found that arch volumes are not mounted. While tried to remount them and checked through "df -h", it was throwing below error:
[root@customer-pet-oracle-3d ~]#  df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/cciss/c0d0p3      58G   48G  7.3G  87% /
/dev/cciss/c0d0p1     494M   18M  452M   4% /boot
tmpfs                  63G  232M   63G   1% /dev/shm
tmpfs                 4.0K     0  4.0K   0% /dev/vx
df: `/cinprds1_arch00': Input/output error
df: `/cinprd1_arch00': Input/output error
df: `/customerdcdp1_arch00': Input/output error
df: `/customerrptp1_arch00': Input/output error
example-prod-sea1utilnas-1a-pet:/vol/customerpet_data
                      450G  335G  116G  75% /filers/example-prod-sea1utilnas-1a-pet/customerpet_data
[root@customer-pet-oracle-3d ~]
OS Environment : RHEL 5.5
Software/Application :
DB : oracle 11.2.0.4
VxVm : VRTSvxvm-5.1.100.000-SP1_RHEL5, Symantec License Manager vxlicrep utility version 3.02.51.010
vxfs : VRTSvxfs-5.1.100.000-SP1_GA_RHEL5
Customer Environment : ATT PET Oracle DB
Investigation :
1.
$sanlun lun show|grep -i minipet_arc

customer-pet-sea1bfiler-1a:  /vol/customer_MINIPET_ARCH/lun1                /dev/sdag        host1    FCP        500.1g (536952700928)   GOOD
customer-pet-sea1bfiler-1a:  /vol/customer_MINIPET_ARCH/lun0                /dev/sdah        host1    FCP        500.1g (536952700928)   GOOD
customer-pet-sea1bfiler-1a:  /vol/customer_MINIPET_ARCH/lun2                /dev/sdai        host1    FCP          250g (268435456000)   GOOD
customer-pet-sea1bfiler-1a:  /vol/customer_MINIPET_ARCH/lun3                /dev/sdaj        host1    FCP          250g (268435456000)   GOOD
2. Check fstab entry how it is :
fstab was :
/dev/vx/dsk/minipet_arch_dg/cinprds_minipet_vol_arch00 /cinprds1_arch00 vxfs    _netdev 0 1
3. Check netfs service if running:
$/etc/init.d/netfs status
4. Search dg in log as root  :
$ awk '/arch_dg/ {print $0}' /var/log/messages.*

Sep 25 19:26:08 customer-pet-oracle-3d kernel: vxfs: msgcnt 1 mesg 037: V-2-37: vx_metaioerr - vx_inode_iodone - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system meta data write error in dev/block 0/1104
Sep 25 19:26:08 customer-pet-oracle-3d vxvm:vxconfigd: V-5-1-7935 Disk group minipet_arch_dg: update failed: Disk group has no valid configuration copies
Sep 25 19:26:08 customer-pet-oracle-3d vxvm:vxconfigd: V-5-1-7934 Disk group minipet_arch_dg: Disabled by errors
[...]
Sep 25 19:30:01 customer-pet-oracle-3d kernel: VxVM vxio V-5-3-1285 voldmp_errbuf_sio_start: Failed to flush the error buffer ffff811130c6aa00 on device 0xc900130 to DMP<4>vxfs: msgcnt 5 mesg 039: V-2-39: vx_writesuper - /dev/vx/dsk/minipet_arch_dg/cinprds_minipet_vol_arch00 file system super-block write error
Sep 25 19:30:01 customer-pet-oracle-3d kernel: vxfs: msgcnt 6 mesg 037: V-2-37: vx_metaioerr - vx_dirbread - /dev/vx/dsk/minipet_arch_dg/cinprds_minipet_vol_arch00 file system meta data write error in dev/block 0/1104
[...]
Sep 25 19:40:01 customer-pet-oracle-3d kernel: vxfs: msgcnt 21 mesg 039: V-2-39: vx_writesuper - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system super-block write error
Sep 25 19:40:01 customer-pet-oracle-3d kernel: vxfs: msgcnt 22 mesg 008: V-2-8: vx_direrr: vx_readdir_int_1 - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system dir inode 5 dev/block 0/150297879 dirent inode 0 error 5
Sep 25 19:40:01 customer-pet-oracle-3d kernel: vxfs: msgcnt 23 mesg 039: V-2-39: vx_writesuper - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system super-block write error
[...]
Sep 26 04:18:13 customer-pet-oracle-3d kernel: vxfs: msgcnt 334 mesg 016: V-2-16: vx_ilisterr: vx_iread - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system error reading inode 3
Sep 26 04:18:13 customer-pet-oracle-3d kernel: vxfs: msgcnt 335 mesg 039: V-2-39: vx_writesuper - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system super-block write error
Sep 26 04:18:13 customer-pet-oracle-3d kernel: vxfs: msgcnt 336 mesg 031: V-2-31: vx_disable - /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00 file system disabled
Above confirms that few blocks are corrupted in disks which are under disk group "minipet_arch_dg"

Permanent Solution :
1. Unmount file system if mounted. 2. Run file system check through fsck :
$fsck -F vxfs -o full 
example :
$/opt/VRTS/bin/fsck -o full -y /dev/vx/rdsk/minipet_arch_dg/rpt_minipet_vol_arch00

or 

$fsck.vxfs -o full /dev/vx/dsk/minipet_arch_dg/rpt_minipet_vol_arch00
3. OR:
or reboot system [make sure fsck is enabled in fstab]

Root Cause Analysis :

Error messages in system log confirm that disk blocks are corrupted. vxiod was failing to write data. vxconfigd informed kernel that it was unable to change vxfs config. 

Tuesday, August 11, 2015

Why does liwewise throw error like "Problem executing /opt/pbis/bin/ad-cache --delete-all >/dev/null 2>/dev/null"?

OS Environment :  Linux
Application : pbis aka likewise with version 8.3
Problem: While executing following command, it was throwing error :

/opt/likewise/bin/lwconfig --file /opt/pbis/bin/lwconfig.txt
Problem executing '/opt/pbis/bin/ad-cache --delete-all >/dev/null 2>/dev/null'

Error: Error returned by external program

Solution : As ad-cache is linked to lsa, so copy lsa binary from machine where pbis 8.0 is installed to the affected machine.
Solution : ad-cache is linked to lsa binary and this has bug.