In this paper, we extend our previous work on direct recognition of single-channel multi-talker mixed speech using permutation invari-ant training (PIT). We propose to adapt the PIT models with aux-iliary features such as pitch and i-vector, and to exploit the gender information with multi-task learning which jointly optimizes for the speech recognition and speaker-pair prediction. We also compare CNN-BLSTMs against BLSTM-RNNs used in our previous PIT-ASR model. The experimental results on the artificially mixed two-talker AMI data indicate that our proposed model improvements can reduce word error rate (WER) by 10.0% relative to our previous work for both speakers in the mixed speech. Our results also con-firm that PIT can be easily combined with advanced techniques to improve the performance on multi-talker speech recognition.